library(mosaic)
library(mosaicData)
library(tidyverse)
library(car)
library(ggplot2)
library(plotly)
library(datasets)
library(DT)
library(knitr)
library(pander)
library(alr4)
library(dplyr)
library(MASS)
Q: Perform a regression using the airquality data set in R where the response variable is Ozone and the explanatory variable is Wind.
What is the value of the MSE of this regression rounded to the nearest whole number?
A: 701
Squaring the residual standard error of 26.47 gives 700.6609 for the MSE. Or, computing the MSE directly gives
sum(air.lm$res^2)/114
In either case the result is MSE = 701
air.lm <- lm(Ozone ~ Wind, data=airquality)
summary(air.lm)
##
## Call:
## lm(formula = Ozone ~ Wind, data = airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.572 -18.854 -4.868 15.234 90.000
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 96.8729 7.2387 13.38 < 2e-16 ***
## Wind -5.5509 0.6904 -8.04 9.27e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.47 on 114 degrees of freedom
## (37 observations deleted due to missingness)
## Multiple R-squared: 0.3619, Adjusted R-squared: 0.3563
## F-statistic: 64.64 on 1 and 114 DF, p-value: 9.272e-13
sum(air.lm$res^2)/114
## [1] 700.5177
Q: Use the RailTrail data set in R from library(mosaic) to fit a regression to the following scatterplot. Compute the value of the residual for the solid dot labeled in the plot.
A: 0.17157
Run these codes:
View(RailTrail)
rail.lm <- lm(hightemp ~ lowtemp, data=RailTrail)
rail.lm$residuals[25]
predict(rail.lm, data.frame(lowtemp=c(35)))
To find the residual, use:
\[r_i = Y_i - \hat{Y_i} = 61 - 60.82... = 0.17157\]
rt.lm <- lm(hightemp ~ lowtemp, data = RailTrail)
plot(hightemp ~ lowtemp, data = RailTrail)
abline(rt.lm)
predict(rt.lm, data.frame(lowtemp = 35))
## 1
## 60.82843
61 - 60.82843
## [1] 0.17157
Q: Each of the following equations come from student submissions of the Predicting the Weather Analysis. Most of them have mistakes in them.
Which equation does not have any mistakes in it?
A: \(E{\underbrace{Y_i}_\text{High Temp}} = \beta_0 + \beta_i \underbrace{X_i}_\text{Complicated X}\)
In our course, the three symbols of \(Y_i, \ \hat{Y}_i, \ \text{and} \ E\{Y_i\}\) all represent different things. So when an equation uses the \(Y_i\) it should be of the form:
\(Y_i = \beta_0 + \beta_1 X_i + \epsilon_i \quad \text{where} \ \epsilon_i \sim N(0,\sigma^2)\)
When an equation uses the predicted value of it should be of the form:
\(\hat{Y}_i = b_0 + b_1 X_{i}\)
and usually we replace the \(b_0\) and \(b_1\) values with their numeric values from the “Estimate” column of the regression summary output in RStudio.
Finally, when an equation uses the \(E\{Y_i\}\) it is denoting the true regression relation and it should be of the form: \(E\{Y_i\} = \beta_0 + \beta_1 X_{i}\)
Q: What is the value of the MSE for the simple linear regression depicted in the plot? - You shouldn’t be able to tell by just looking at it. Do some calculations. The residual for each point has been labeled for you.
A: 118.4
As the residuals are all depicted in the graph, we can calculate the SSE using:
sum( c(-6.6,13.8,-9.8,4.6,-2)^2 )
which equals 355.2.
Since the line is a “simple linear regression” line, it must use two parameters, - By visually counting, we see there are 5 dots in the plot, so n=5.
MSE = SSE/(n-p) = 355.2 / (5 - 2) = 118.4
res <- c(-6.6, 13.8, -9.8, 4.6, -2)
SSE <- sum(res^2)
n <- length(res)
MSE <- SSE / n-2
print(MSE)
## [1] 69.04
Q: For the Residuals vs Fitted plot shown, what is the value of \(Y_i\) for the solid orange dot?
A: 45
The plot shows the solid orange dot in the plot has a fitted-value of 60 by looking at the x-axis of the plot. The vertical axis shows the residual axis which gives the point a -15 value by looking at the y-axis of the plot. Using the equation of a residual
\[ r_i = Y_i - \hat{Y}_i \\ Y_i = r_i + \hat{Y}_i \ \text{(solve for Y)} \]
shows that the y-value can be found by adding the value of the residual to the fitted value.
That gives a guess of 60 - 15 = 45 for the value of for the solid orange dot.
Q: Suppose a researcher developed the theory that each one degree increase in daily average temperatures reduced average daily wind speed by 1/6 a mile per hour.
Use the airquality data to test the following hypothesis.
\[\underbrace{Y_i}_\text{Wind} = \beta_0 + \beta_1 \underbrace{X_i}_\text{Temp} + \epsilon_i \text{ where } \epsilon_i \sim N(0,\sigma^2)\]
\[H_0: \beta_1 = -0.167 \\ H_a: \beta_1 \neq -0.167\]
Report the value of the p-value of the test.
A: p-value = 0.898
wind.lm <- lm(Wind ~ Temp, data = airquality)
summary(wind.lm)
##
## Call:
## lm(formula = Wind ~ Temp, data = airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.5784 -2.4489 -0.2261 1.9853 9.7398
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.23369 2.11239 10.999 < 2e-16 ***
## Temp -0.17046 0.02693 -6.331 2.64e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.142 on 151 degrees of freedom
## Multiple R-squared: 0.2098, Adjusted R-squared: 0.2045
## F-statistic: 40.08 on 1 and 151 DF, p-value: 2.642e-09
predict(wind.lm, data.frame(Temp=-0.167))
## 1
## 23.26216
twind = (-0.17046 - -0.167) / 0.02693
2*pt(-abs(twind), 151)
## [1] 0.8979391
Q: Continue using the regression model from the previous question on the airquality data set:
\[\underbrace{Y_i}_\text{Wind} = \beta_0 + \beta_1 \underbrace{X_i}_\text{Temp} + \epsilon_i \text{ where } \epsilon_i \sim N(0,\sigma^2)\]
Suppose the impossible happened and May 1st of a certain year ended up being 0 degrees Fahrenheit in New York City.
What daily average wind speed (in miles per hour) does your regression model expect for such a day?
A: 23 mph
wind.lm <- lm(Wind ~ Temp, data=airquality)
coef(wind.lm)
## (Intercept) Temp
## 23.2336881 -0.1704644
Q: A regression is performed, and the following output obtained. Use this output to create a 95% confidence interval for the true y-intercept (\(\beta_0\)) corresponding to this regression.
A: (1.2711, 2.6979)
(1.9845 + 5.636)*0.3521
## [1] 2.683178
(1.9845 - 5.636)*0.3521
## [1] -1.285693
1.9845 + qt(c(0.025, 0.975), 37)*0.3521
## [1] 1.271078 2.697922
Q: Use the mtcars data set in R to perform a log(Y) transformation regression of mpg predicted by wt.
Use your transformation regression to predict the gas mileage (mpg) of a vehicle that has a weight of 2,000 lbs.
Be sure to take note of the ?mtcars help file and see the wt variable for details on how this variable is recorded.
A: 26.79846
lm1 <- lm(mpg ~ wt, data=mtcars)
boxCox(lm1)
#Suggests a 0 value is best, or log(Y) transformation.
lm2 <- lm(log(mpg) ~ wt, data=mtcars)
exp(predict(lm2, data.frame(wt=2))) #prediction transformed back
## 1
## 26.79846
Q: Run these codes in R:
library(mosaic) #contains the Weather data
plot(high_temp ~ high_dewpt, data=Weather)
Fit the following regression model on the data.
\[\underbrace{Y_i}_\text{High Temp} = \beta_0 + \beta_1 \underbrace{X_i}_\text{High Dew Point} + \epsilon_i \text{ where } \epsilon_i \sim N(0,\sigma^2)\]
Select the plot below that correctly depicts (for the above model) the 95% prediction interval for the high temperature when the high dew point is 80.
A:
# actual
plot(high_temp ~ high_dewpt, data=Weather, ylim=c(0,120))
temp_lm <- lm(high_temp ~ high_dewpt, data=Weather)
abline(temp_lm)
predict(temp_lm, data.frame(high_dewpt=80), interval="prediction")
## fit lwr upr
## 1 91.52634 74.49373 108.559
lines(c(80,80), c(74.49373, 108.559), lwd=6, col=rgb(0.2,.8,.5,.5))
lines(c(-21,80,80,-21), c(74.49373, 74.49373, 108.559, 108.559), lty=2, col="gray")
Q: Which of the following is a “best” statement to make when interpreting a regression slope?
A: The change in the average y-value per unit change in x.
Q: Suppose you are asked to help decide how wide (in centimeters) kid’s shoes should be for fourth grade girls with foot lengths of 22 centimeters using the KidsFeet dataset, which contains data about fourth grade girls and boys.
Perform a meaningful regression on the girls data provided in the KidsFeet data set. Select the interval below that, according to your regression, we are 95% confident that it contains 95% of the actual widths of fourth-grade girl’s feet that are 22 centimeters long.
A: We predict girls feet to be anywhere from 7.32 cm to 9.19 cm wide, when their foot length is 22 cm.
girlsFeet <- filter(KidsFeet, sex=="G")
plot(width ~ length, data=girlsFeet)
girls.lm <- lm(width ~ length, data=girlsFeet)
predict(girls.lm, data.frame(length=22), interval="prediction")
## fit lwr upr
## 1 8.253979 7.320637 9.187321
Q: The solid orange dot shown in the scatterplot below corresponds with which point on the residuals vs. fitted plot?
A: The point labeled “Chrysler Imperial”
plot(mpg ~ wt, data=mtcars, main="mtcars data set")
points(mpg ~ wt, data=mtcars[17,], pch=16, col="orange", cex=1.2)
mt.lm <- lm(mpg ~ wt, data=mtcars)
abline(mt.lm)
plot(mt.lm, which=1)
points(mt.lm$res[17] ~ mt.lm$fit[17], pch=16, cex=1.2, col="orange")
Q: Consider the following scatterplot. Apply an appropriate Y transformation to the data shown in the plot.
State the R-squared value of the regression on the transformed data.
plot(Ozone ~ Temp, data=airquality)
Or, if you prefer, here is the plot in ggplot2:
ggplot(airquality, aes(x=Temp, y=Ozone)) + geom_point()
A: 0.5632
ggplot(airquality, aes(x=Temp, y=Ozone)) +
geom_point()
## Warning: Removed 37 rows containing missing values or values outside the scale range
## (`geom_point()`).
air3.lm <- lm(Ozone ~ Temp, data=airquality)
boxCox(air3.lm)
air3ss.lm <- lm(sqrt(sqrt(Ozone)) ~ Temp, data=airquality)
summary(air3ss.lm)
##
## Call:
## lm(formula = sqrt(sqrt(Ozone)) ~ Temp, data = airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.86085 -0.23007 0.00676 0.21087 1.07342
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.661530 0.254646 -2.598 0.0106 *
## Temp 0.039362 0.003246 12.125 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3302 on 114 degrees of freedom
## (37 observations deleted due to missingness)
## Multiple R-squared: 0.5632, Adjusted R-squared: 0.5594
## F-statistic: 147 on 1 and 114 DF, p-value: < 2.2e-16
air3log.lm <- lm(log(Ozone) ~ Temp, data=airquality)
summary(air3log.lm)
##
## Call:
## lm(formula = log(Ozone) ~ Temp, data = airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.14469 -0.33095 0.02961 0.36507 1.49421
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.83797 0.45100 -4.075 8.53e-05 ***
## Temp 0.06750 0.00575 11.741 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5848 on 114 degrees of freedom
## (37 observations deleted due to missingness)
## Multiple R-squared: 0.5473, Adjusted R-squared: 0.5434
## F-statistic: 137.8 on 1 and 114 DF, p-value: < 2.2e-16
# saunders code
lm1 <- lm(Ozone ~ Temp, data=airquality)
boxCox(lm1)
#the results of Box-Cox show a transformation should be applied, which is lambde = 0.25
air.lm.t <- lm(sqrt(sqrt(Ozone)) ~ Temp, data=airquality)
summary(air.lm.t)
##
## Call:
## lm(formula = sqrt(sqrt(Ozone)) ~ Temp, data = airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.86085 -0.23007 0.00676 0.21087 1.07342
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.661530 0.254646 -2.598 0.0106 *
## Temp 0.039362 0.003246 12.125 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3302 on 114 degrees of freedom
## (37 observations deleted due to missingness)
## Multiple R-squared: 0.5632, Adjusted R-squared: 0.5594
## F-statistic: 147 on 1 and 114 DF, p-value: < 2.2e-16
#Gives an R-squared of 0.5632
Q: Open the ChickWeight data set in R.
Perform a linear regression that predicts the weight of chickens according to how old the chick is in days.
Which of the following correctly diagnoses the appropriateness of this regression?
A: Constant Variance and normal errors are both violated, but the linear relation isn’t too bad, relatively speaking.
chick.lm <- lm(weight ~ Time, data = ChickWeight)
summary(chick.lm)
##
## Call:
## lm(formula = weight ~ Time, data = ChickWeight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -138.331 -14.536 0.926 13.533 160.669
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.4674 3.0365 9.046 <2e-16 ***
## Time 8.8030 0.2397 36.725 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38.91 on 576 degrees of freedom
## Multiple R-squared: 0.7007, Adjusted R-squared: 0.7002
## F-statistic: 1349 on 1 and 576 DF, p-value: < 2.2e-16
par(mfrow=c(1,3))
plot(chick.lm, which=1)
qqPlot(chick.lm$residuals, main="Q-Q Plot", col="steelblue3", col.lines="steelblue4", pch= 19, id=FALSE)
plot(chick.lm$residuals, main = "Residuals vs. Order")
Q: Consider the linear model
\[Y_i = \beta_0 + \beta_1X_i+ \epsilon_i \text{ where } \epsilon_i \sim N(0,\sigma^2)\]
Which of the following is a true statement about this model?
A: \(\beta_0 + \beta_1 X_i\) describes the functional relationship of the data to the true model.
Q: What is the value of SSE according to the output shown below?
A: Value of SSE: 25.2
Using the residual standard error: 1.875^2*22 - Squaring the residual standard error gives MSE. Multiplying MSE by (n-p) or 22 degrees of freedom gives back the SSE because MSE = SSE/(n-p) and sqrt(MSE) = residual standard error.
Q: A researcher develops a regression model to predict the highway miles per gallon of a vehicle based on the city miles per gallon of the vehicle. They run the following codes to do this.
?mpg
View(mpg)
plot(hwy ~ cty, data = mpg)
mpg.lm <- lm(hwy ~ cty, data=mpg)
The researcher is wondering how to best interpret the slope from their model.
Select the statement below that best interprets the slope of their model.
mpg.lm <- lm(hwy ~ cty, data=mpg)
summary(mpg.lm)
##
## Call:
## lm(formula = hwy ~ cty, data = mpg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.3408 -1.2790 0.0214 1.0338 4.0461
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.89204 0.46895 1.902 0.0584 .
## cty 1.33746 0.02697 49.585 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.752 on 232 degrees of freedom
## Multiple R-squared: 0.9138, Adjusted R-squared: 0.9134
## F-statistic: 2459 on 1 and 232 DF, p-value: < 2.2e-16
plot(hwy ~ cty, data = mpg)
abline(mpg.lm)
A: The average highway gas mileage of vehicles increases by 1.337 miles per gallon for every 1 mpg increase in city gas mileage of the vehicle.
Q: Select the plot below that correctly depicts the least squares regression line for the data shown. - You should be able to visually tell which one is the least squares line, just by looking at the plots.
A: The least squares line should go through the average of the Y-values as you move across the x-values. Only one plot does this visually.
Q: This question goes a little beyond your current knowledge, but uses ideas you should be familiar with in the same way you have been using them previously.
What is the p-value for the missing entry in the output below? Note: the sample size for this regression was n=30.
A: where pt(estimate / std. error, n - p)*2 where p is the number of parameters being estimated by the model, which is 4. (Recognizing the 4 is the hardest part.)
tm <- -109.736 / 374.404
print(tm)
## [1] -0.2930952
pt(-abs(tm), 28)*2
## [1] 0.7716107
# correct answer:
pt(-109.736/374.404, 30-4)*2
## [1] 0.771776
Q: Use the Orange dataset in R to perform a linear regression of the circumference (y) of an orange tree according to the age of the tree (x).
Which regression assumption is not satisfied for this regression?
A: The Q-Q Plot looks great. Linearity is okay, even though the red line is a little jagged, the points don’t seem to hold too closely to the line. Clearly the x-values are all fixed as they are “replicated” in the study. The variance shows a strong increasing trend in the vertical variability of the residuals as we move from left to right in the residuals plot.
plot(circumference ~ age, data=Orange)
lm1 <- lm(circumference ~ age, data=Orange)
plot(lm1, which=1)
qqPlot(lm1$res)
## [1] 21 27
–
Q: Which of the following best describes SSR?
A: SSR: A measure of how much the regression line differs from the average of all of the y-values.
Q: Consider transforming Y in the regression of stopping distance on speed using the cars dataset in R. Which of the following shows the “best” Y-transformation possible for this data in the original units?
carssd.lm <- lm(dist ~ speed, data = cars)
boxCox(carssd.lm)
lm.log <- lm(log(dist) ~ speed, data = cars)
b.log <- coef(lm.log)
lm.sqrt <-lm(sqrt(dist) ~ speed, data = cars)
b.sqrt <- coef(lm.sqrt)
lm.1oy <- lm(1/dist ~ speed, data = cars)
b.1oy <- coef(lm.1oy)
lm.y <- lm(dist ~ speed, data = cars)
b.y <- coef(lm.y)
lm.2 <- lm(dist^2 ~ speed, data = cars)
b.2 <- coef(lm.2)
lm.ss <- lm(sqrt(sqrt(dist)) ~ speed, data = cars)
b.ss <- coef(lm.ss)
plot(dist ~ speed, data = cars, pch=16, col="orangered", main="cars", xlab="Speed", ylab="dist")
legend("topleft", legend=c("1","2","3","4","5","6"), lty=1, col=c(1,2,3,4,5,6))
curve( (b.sqrt[1] + b.sqrt[2]*x)^2 , add=TRUE, col=2)
Q: Consider the scatterplot shown below. Create this scatterplot in R and add the regression line to the plot.
At which age does the actual average of the dots seem to be furthest from the line?
lob.lm <- lm(height ~ age, data = Loblolly)
plot(height ~ age, data = Loblolly)
abline(lob.lm)
par(mfrow=c(1,3))
plot(lob.lm, which=1)
qqPlot(lob.lm$residuals, main="Q-Q Plot", col="steelblue3", col.lines="steelblue4", pch= 19, id=FALSE)
plot(lob.lm$residuals, main = "Residuals vs. Order")
A: Age 3 - The red line is furthest from the gray line for the first and last groups, ages 3 and 20. Since only age 3 is an option in the list above, it is the best answer.
lob.lm <- lm(height ~ age, data=Loblolly)
plot(lob.lm, which=1)
abline(lob.lm)
Q: Consider the several scatterplots shown below of Y against three different x-variables, and each x-variable against each other. In each case, the vertical axis of the graph is the variable name to the side of the plot, while the horizontal axis is given by the variable name above or below the plot.
Which plot would have the highest R-squared value when using a simple linear regression (no transformations) in each case?
A: The plot of Y ~ X1
Q: Which of the following statements contains an error?
A: \(\underbrace{\hat{Y_i}}_\text{Weight} = \beta_0 + \beta_1 \underbrace{X_i}_\text{height}\)
Explanation: When using \(\hat{Y_i}\) we need to refer to the estimates of the \(\beta_0\) and \(\beta_1\) terms instead of their “true” values. Typically we do this with \(b_0\) and \(b_1\) but the real error is in labeling y-hat with “weight”. It should be labeled as “predicted weight” or “estimated mean weight” or something else that emphasizes that it is the estimate of the “average weight” for every given x-value. It is no longer the weight of an individual, that is what \(Y_i\) represents.
Statements with NO errors!
\[E{\underbrace{Y_i}_\text{weight}} = \beta_0 + \beta_1 \underbrace{X_i}_\text{height}\]
\[\underbrace{Y_i}_\text{weight} = \beta_0 + \beta_1 \underbrace{X_i}_\text{height} + \epsilon_i \text{ where } \epsilon_i \sim N(0, \sigma^2)\]
\[\underbrace{Y_i}_\text{weight} = \beta_0 + \beta_1 \underbrace{X_i}_\text{height} + \beta_2 \underbrace{X_{2i}}_\text{age} + \epsilon_i \text{ where } \epsilon_i \sim N(0, \sigma^2)\]
Q: Suppose a researcher from the 1920’s was studying how far it took vehicles to stop (in feet) based on the speed they were traveling (in mph) when the brakes were applied.
Suppose further that they came up with the following regression model from their study.
\[\hat{Y_i} = -17 + 4.2X_i\]
Use the “cars” data set in R to validate this regression model. What is the value of the validation R-squared for this model? (Hint: Yhat <- -17…)
A: 0.6141207
Yhat <- -17 + 4.2*cars$speed
Ybar <- mean(cars$dist)
SSE <- sum( (Yhat - cars$dist)^2 )
SSTO <- sum( (cars$dist - Ybar)^2 )
1 - SSE/SSTO
## [1] 0.6141207
Q: Four degree 1 lowess curves are shown below, each time on the same cars data.
Which lowess curve uses 20% of the neighboring dots at each local regression to create the curve?
You’ll need to draw it yourself to really know.
A:
plot(dist ~ speed, data=cars, main="car data set")
lines(lowess(cars$speed, cars$dist, f=.2))
Q: A logistic regression is performed to model the probability that a student will get an A in Math 325 based on their Analysis score. The following output is obtained.
Which of the following is a correct interpretation of this output?
A: The odds of a student getting an A in Math 325 triple for every extra point a student gains in their Analysis Total score.
exp(1.1423)
## [1] 3.133968
Q: A regression is performed, and the following output obtained. What conclusion can we make from this output?
A: This regression model significantly predicts the mean Y-value, but does not do the best job at predicting the individual Y values.
Further Explanation: The p-value tells us if the X-variable can significantly predict the average y-value. Since the p-value is significant (3.41e-08), X is useful for predicting the average y-value. However, since the R-squared value is 0.5655, the model does not really offer very much insight on the individual y-values because they vary so much from the line. When the R-squared value is close to 1, then the regression model gives great insight about both the mean, and the individuals.
Incorrect Conclusions:
Q: Suppose a student fits the model Y ~ X2 after looking at the following pairs plot.
They then compute the residuals (here called R) from their regression and make this second pairs plot.
What should they do next?
A: They should add a quadratic term for X2 to their model.
Further Explanation: The only “added variable plot” that has information in it is the one of R compared to X2. That plot shows a cupping shape, implying that a higher order term for X2 is needed. Since the shape is parabolic, it would be wisest to add a quadratic term to the current model, to obtain Y ~ X2 + I(X2^2).
Incorrect Next Steps:
Q: Use the mtcars data set in R to plot the robust regression line for hp ~ wt.
Identify the robust regression line in the plot below.
A: blue line
library(MASS)
plot(hp ~ wt, data=mtcars)
abline(rlm(hp ~ wt, data=mtcars), col="dodgerblue", lwd=2) #correct answer
abline(lm(hp ~ wt, data=mtcars), col="green", lwd=2)
abline(-.5, 35, col="orange", lwd=2)
abline(-2, 50, col="firebrick", lwd=2)
mtext(side=3, text="mtcars data set")
Q: Which logistic curve represents the males in the following plot?
A: The male line is the orange line as men are less likely to have survived the titanic than women.
Q: Based on the information shown below, what is the predicted cost of a Toyota Corolla with 40,000 miles on it?
A: $14,510
To get the answer you must use the coefficients in the summary output below the graph.
Then the equation becomes:
\[ \text{(intercept) + mileage * x(40000 miles) + corolla + corrolla:mileage * x(40000 miles)} = 19408 - 0.1926*40000-2002+0.1202*40000 = 14510 \]
Q: Which of the following is a “best” statement to make when interpreting a regression slope?
A: The change in the average y-value per unit change in x.
Further Explanation: The regression line models the average y-value for any given x-value. Since the slope is the change in the line as x changes by 1 unit, then the slope should be interpreted as the change in the average y-value (the line) for each unit change in x.
Incorrect statements:
Which of the following graphs shows a situation where it would be possible for the following numbers to occur?
SSE = 1.698519
SSR = 2730.497
SSTO = 2738.636
A:
Use the cars data set in R to test the following hypotheses.
\[\underbrace{Y_i}_\text{dist} = \beta_0 + \beta_1 \underbrace{X_i}_\text{speed} + \epsilon_i \text{ where } \epsilon_i \sim N(0, \sigma^2)\]
\[H_0: \beta_1 = 4.2 \\ H_a: \beta_1 \neq 4.2\]
Report the p-value of your test.
A: 0.52
summary(lm(dist ~ speed, data=cars))
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
\[t = (3.9324 - 4.2)/0.4155 = -0.6440433 \\ pt(1,48)*2 = 0.5226131\]
Q: Estimate the coefficients in the following regression model using the airquality data set in R.
\[\underbrace{Y_i}_\text{Temp} = \beta_0 + \beta_1 \underbrace{X_i}_\text{Month} + \beta_2 X_{i}^2 + \epsilon_i \text{ where } \epsilon_i \sim N(0, \sigma^2)\]
A: \(b_0 = -95.73, b_1 = 48.72, b_2 = -3.28\)
coef(lm(Temp ~ Month + I(Month^2), data=airquality))
## (Intercept) Month I(Month^2)
## -95.725568 48.718276 -3.282813
Q: Which of the following graphs would support the statement that “the odds that Y = 1 increases by 20% for each 1 unit change in x”?
A:
Further Explanation: An increase in the odds of 20% per unit change in X is fairly substantial. Thus, we would be looking for a graph that is (1) increasing in “slope” from left to right and (2) that does so fairly sharply. The shallower a curve, the smaller the percentage increase in the odds. The steeper the curve, the greater the percentage increase in the odds.
Incorrect graphics:
Q: Suppose you are asked to help decide how wide (in centimeters) kid’s shoes should be for kids with foot lengths of 25 centimeters. Suppose further you are provided with the KidsFeet dataset found in library(mosaic). Perform a regression that would help provide an answer to this question.
Select the interval below that, according to your regression, should contain 95% of the widths of kids feet that are 25 centimeters long.
A: We predict kids feet to be anywhere from 8.25 cm to 9.87 cm wide, when their foot length is 25 cm.
plot(width ~ length, data=KidsFeet)
predict(lm(width ~ length, data=KidsFeet), data.frame(length=25), interval="prediction")
## fit lwr upr
## 1 9.06097 8.247227 9.874713
Q: What is the value of the residual for the orange dot in the plot below?
A: 42.52537
The easiest way to locate this value is with the residual plot, which identifies the three largest residuals automatically.
cars.lm <- lm(dist ~ speed, data=cars)
plot(cars.lm, which=1)
From that plot, you should notice that point 23 is in the same position as the orange dot in the plot of this question. Thus,
cars.lm$residuals[23]
## 23
## 42.52537
Q: Which regression from the options below would best fit the data for the scatterplot shown?
A: A lowess curve.
X <- runif(30,0,50)
Y <- 20000 - 800*X + 25*X^2 + -.03*X^3 + .005*X^4 - .0002*X^5 + rnorm(30,0,1000)
plot(Y ~ X)
Q: Return to the RailTrail data set in R.
Perform an appropriate analysis that would allow you to predict the average volume of users on the trail by knowing the cloud cover measurement for a given day.
Draw a graphic that includes the results of your analysis in the plot.
Use your graphic to estimate the average volume for days with a cloud cover measurement of 8.
A: Around 343 users on average.
Run these codes…
plot(volume ~ cloudcover, data=RailTrail)
abline(lm(volume ~ cloudcover, data=RailTrail))
# Then look at the plot above an x-value of 8.
abline(v=8, lty=2) #if this helps your eyes, then use it.
# That corresponds to a y-value of about 343
abline(h=343, lty=2)
To find the exact value, do some math:
summary(lm(volume ~ cloudcover, data=RailTrail))
##
## Call:
## lm(formula = volume ~ cloudcover, data = RailTrail)
##
## Residuals:
## Min 1Q Median 3Q Max
## -217.61 -80.66 -3.98 67.71 348.66
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 461.322 25.904 17.81 < 2e-16 ***
## cloudcover -14.797 3.905 -3.79 0.000276 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 118.9 on 88 degrees of freedom
## Multiple R-squared: 0.1403, Adjusted R-squared: 0.1305
## F-statistic: 14.36 on 1 and 88 DF, p-value: 0.0002757
461.322 - 14.797*8
## [1] 342.946
Q: The following plots show the same five points in each plot but a different line in each case. Which plot shows a regression line with the smallest SSE, i.e., the least squares regression line? - You shouldn’t be able to tell by just looking at it. Do some calculations.
A: Computing the SSE for each graph identifies the graph with:
sum(c(-7.1,13,-7.4,4.1,-2.7)^2) # = 298.27
as the line of least squares.
Q: View the Marriage data in R.
This data set lists characteristics of 98 individuals (49 couples) at the time they were married. The age column lists the age of each individual. The prevcount column states how many times the individual was previously married prior to this marriage. So a prevcount of 2 means that person was previously married twice.
Perform an analysis using this data that looks to see if the age of the individual at the time of marriage can predict if that person was previously married.
Then, suppose that Frank is getting married and is 27 years old. Based on your analysis, what is the probability that Frank was previously married?
A: There is about a 32.3% chance that Frank was previously married.
myglm <- glm(prevcount > 0 ~ age, data=Marriage, family=binomial)
predict(myglm, data.frame(age=27), type="response")
## 1
## 0.3232086
Q: Run the following code in R.
plot(time ~ age, data=TenMileRace, col=sex)
plot(time ~ age, data=TenMileRace, col=sex)
Perform an appropriate analysis that would allow you to add two linear regression lines with the same slope to this plot. One line should be for the male’s data, the other for the female’s data. The common slope of the two lines should be 16.493.
Based on the results of your analysis, by how much do the average times of males and females differ?
A: By about 775.908 seconds. (look at sexM coefficient)
mylm <- lm(time ~ age + sex, data=TenMileRace)
summary(mylm)
##
## Call:
## lm(formula = time ~ age + sex, data = TenMileRace)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3032.4 -637.5 -28.8 596.5 4663.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5592.610 37.604 148.72 <2e-16 ***
## age 16.493 1.013 16.28 <2e-16 ***
## sexM -775.908 21.478 -36.13 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 965.3 on 8633 degrees of freedom
## Multiple R-squared: 0.136, Adjusted R-squared: 0.1358
## F-statistic: 679.5 on 2 and 8633 DF, p-value: < 2.2e-16
Q: Use the airquality data set in R to perform a multiple regression of the following model.
\[\underbrace{Y_i}_\text{Ozone} = \beta_0 + \beta_1 \underbrace{X_i}_\text{Wind} + \beta_2 \underbrace{X_{2i}}_\text{Temp} + \beta_3 \underbrace{X_{1i}}_\text{Wind} \underbrace{X_{2i}}_\text{Temp} + \epsilon_i\]
Based on your results in R, which of the regression lines shown on the plot below represents the line for a Temp of 70 degrees?
A: A
mylm <- lm(Ozone ~ Wind + Temp + Wind:Temp, data=airquality)
b <- coef(mylm)
plot(Ozone ~ Wind, data=airquality, cex=(Temp/max(Temp))^3+.2, xlim=c(0,22))
curve(b[1] + b[2]*x + b[3]*90 + b[4]*x*90, add=TRUE)
curve(b[1] + b[2]*x + b[3]*80 + b[4]*x*80, add=TRUE)
curve(b[1] + b[2]*x + b[3]*70 + b[4]*x*70, add=TRUE, col="blue")
curve(b[1] + b[2]*x + b[3]*100 + b[4]*x*100, add=TRUE)
text(1,29,"A")
text(1,65,"B")
text(1,105,"C")
text(1,140,"D")
points(c(18,18),c(140,160), cex=c((56/97)^3+.2,1.2))
text(x=c(18,18),y=c(140,160), pos=4, c("56 deg","97 deg"))
Q: Select the residuals vs fitted values plot below that shows the greatest evidence of a non-linear pattern in the data.
A: This graph however is strongly supported by the data and shows a strong bend in the red line. This is a sure witness of a non-linear pattern in the data.(We are looking for a red line that curves systematically and is supported by the data.)
While this plot shows the most dramatic red line, it is least supported by the data because of the small sample size and erratic behavior of the line.
Q: Based on the pairs plot shown below, which model would be the “best” option for this data?
A: Y ~ X1 + X2 + I(X2^2)
So the best choice would be the option that includes X1 and X2^2.
Q: Use the mtcars data set in R to perform an appropriate transformation regression of mpg against hp.
Use your transformation regression to predict the gas mileage (mpg) of a vehicle that has 350 horse power (hp).
A: 9.587017
lm1 <- lm(mpg ~ hp, data=mtcars)
boxCox(lm1)
#Suggests a 0 value is best, or log(Y) transformation.
lm2 <- lm(log(mpg) ~ hp, data=mtcars)
exp(predict(lm2, data.frame(hp=350))) #prediction transformed back
## 1
## 9.587017
We assume the model:
\[
Y_i = \beta_0 + \beta_1 X_i + \epsilon_i, \text{ where } \epsilon_i \sim
N(0, \sigma^2)
\]
This means the relationship between \(X\) and \(Y\) is linear, with some random noise \(\epsilon_i\) that follows a normal distribution.
set.seed(123)
n <- 50
x <- runif(n, 0, 10)
beta0 <- 2
beta1 <- 3
sigma <- 2
epsilon <- rnorm(n, mean = 0, sd = sigma)
y <- beta0 + beta1 * x + epsilon
model <- lm(y ~ x)
plot(x, y, main = "Simple Linear Regression", pch = 19)
abline(model, col = "blue", lwd = 2)
Problem 1
Open the airquality dataset in R. Perform a regression
of daily average Wind(\(Y_i\)) speed using the daily maximum
Temperature (\(X_i\)) as
the explanatory variable.
Part (a)
Type out the mathematical equation for this regression model and label both \(Y\) and \(X\) in the equation.
\[ \underbrace{Y_i}_\text{Wind} = \beta_0 + \beta_1 \underbrace{X_i}_\text{Temp} + \epsilon_i \quad \text{where} \ \epsilon_i \sim N(0,\sigma^2) \]
Part (b)
Fit and summarize a simple linear regression model for this data.
# Type your code there
templm <- lm(Wind ~ Temp, data=airquality)
summary(templm)
##
## Call:
## lm(formula = Wind ~ Temp, data = airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.5784 -2.4489 -0.2261 1.9853 9.7398
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.23369 2.11239 10.999 < 2e-16 ***
## Temp -0.17046 0.02693 -6.331 2.64e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.142 on 151 degrees of freedom
## Multiple R-squared: 0.2098, Adjusted R-squared: 0.2045
## F-statistic: 40.08 on 1 and 151 DF, p-value: 2.642e-09
Part (c)
Type out the estimated equation for this regression model using your estimates found in part (b).
\[ \hat{Y}_i = 23.234 + (-0.170)X_i \]
Part (d)
Plot the data and the estimated regression model.
# Type your code here
ggplot(airquality, aes(x=Temp, y=Wind)) +
geom_point(size=1.5, color="lightblue", alpha= 0.5) +
geom_smooth(method="lm", formula=y~x, se=FALSE, size=0.5,color="skyblue")+
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Part (e)
Use your model to predict the average daily average Wind speed for an outside temperature of 72 degrees F.
\[ \hat{Y}_i = 23.234 + (-0.170)(72) \]
# Type your code here
23.234 + (-0.170)*(72)
## [1] 10.994
Part (f)
Write out an interpretation of the slope and intercept of your model. Are both meaningful for this data?
Slope Interpretation: An increase of 1 degree F in the daily maximum Temperature(\(X_i\)) results in a 0.17 mph decrease in the average daily average Wind ($Y_i) speed.
Intercept Interpretation: When the daily maximum Temperature is 0 degrees F, the average daily average Wind speed is estimated to be 23.234 mph.
Problem 2
Open the mtcars dataset in R. Fit a regression model
to the data that can be used to explain average gas mileage of a
vehicle (mpg) (\(Y_i\))
using the weight (wt) (\(X_i\)) of the vehicle.
Part (a)
Type out the mathematical equation for this regression model and label both \(Y\) and \(X\) in the equation.
\[ \underbrace{Y_i}_\text{mpg} = \beta_0 + \beta_1 \underbrace{X_i}_\text{wt} + \epsilon_i \quad \text{where} \ \epsilon_i \sim N(0,\sigma^2) \]
Part (b)
Fit and summarize a simple linear regression model for this data.
carslm <- lm(mpg ~ wt, data=mtcars)
summary(carslm)
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
Part (c)
Type out the estimated equation for this regression model using your estimates found in part (b).
\[ \hat{Y}_i = 37.29 + (-5.34)X_i \]
Part (d)
Plot the data and the estimated regression model.
# Type your code here
ggplot(mtcars, aes(x=wt, y=mpg)) +
geom_point(size=1.5, color="lightblue", alpha= 0.5) +
geom_smooth(method="lm", formula=y~x, se=FALSE, size=0.5,color="skyblue")+
theme_minimal()
Part (e)
Use your model to predict the average gas mileage (mpg) for a vehicle that weighs 3,000 lbs. (Hint: ?mtcars)
**Must convert 3000 lbs by dividing by 1000, thus it would make it 3*
\[ \hat{Y}_i = 37.29 + (-5.34)(3) \] ANSWER:
# Type your code here
37.29 + (-5.34)*(3)
## [1] 21.27
Part (f)
Write out an interpretation of the slope and intercept of your model. Are both meaningful for this data?
Slope Interpretation: An increase of 1,000 lbs in the weight (\(X_i\)) of a vehicle results in a 5.34 mpg decrease in the average gas mileage of such vehicles(\(Y_i\)).
Intercept Interpretation: The average gas mileage of vehicles that weigh nothing (0 lbs) is estimated to be 37.29 mpg.
Before we can really trust the interpretation of and predictions from a regression model, there are important diagnostic checks to perform on the regression. These diagnostics are even more important to perform when p-values or confidence intervals are used as part of the analysis. In future weeks of this course, we will focus in greater detail on the technical details of regression: hypothesis tests, confidence intervals, and diagnostic checks. However, for the sake of completeness, the following problems have run through these technical details, even though we lack full understanding about them for the time being.
Problem 3
Use your regression for the airquality data set in
Problem 1 to complete the following “technical details”
for this regression.
Part (a)
Create a (1) residuals vs. fitted-values plot, (2) Q-Q Plot of the residuals, and (3) residuals vs. order plot.
# Type your code here
par(mfrow=c(1,3))
plot(templm, which=1)
qqPlot(templm$residuals, main="Q-Q Plot",pch= 19, id=FALSE)
plot(templm$residuals, ylab= "Residuals", main="Residuals vs Order")
Part (b)
Explain, as best you understand currently, what each of these three plots show for this regression.
Explanation: Everything looks pretty good. The residuals vs. fitted-values plot shows constant variance and a nice linear relation. The Q-Q Plot shows possible problems with normality because some dots go out of bounds, but is fairly good. The residuals vs. order plot shows no problems with time trends, so the error terms can be assumed to be independent.
Part (c)
Report the p-value for the test of these hypotheses for your regression.
Intercept Hypotheses
\[ H_0: \beta_0 = 0 \quad \text{vs.} \quad H_a: \beta_0 \neq 0 \]
Slope Hypotheses
\[ H_0: \beta_1 = 0 \quad \text{vs.} \quad H_a: \beta_1 \neq 0 \]
Comment on whether or not we should trust each p-value based on your plots in Part (a).
Should trust each p-value based on your plots in Part (a)?
Intercept Hypotheses p-value? : Yes, it can be trusted because the diagnostic plots all checked out.
Slope Hypotheses p-value? : Yes, it can be trusted because the diagnostic plots all checked out.
Problem 4
Use your regression for the mtcars data set in
Problem 2 to complete the following “technical details”
for this regression.
Part (a)
Create a (1) residuals vs. fitted-values plot, (2) Q-Q Plot of the residuals, and (3) residuals vs. order plot.
# Type your code here
par(mfrow=c(1,3))
plot(carslm, which=1)
qqPlot(carslm$residuals, main="Q-Q Plot",pch= 19, id=FALSE)
plot(carslm$residuals, ylab= "Residuals", main="Residuals vs Order")
Part (b)
Explain, as best you understand currently, what each of these three plots show for this regression.
Explanation: Everything looks somewhat questionable. The residuals vs. fitted-values plot shows a lack of linearity, which makes it hard to judge constant variance. The Q-Q Plot shows possible problems with normality because some dots go out of bounds, but is fairly good. The residuals vs. order plot shows a possible problem with time trends due to the slight rainbow pattern.
Part (c)
Report the p-value for the test of these hypotheses for your regression.
Intercept Hypotheses
\[ H_0: \beta_0 = 0 \quad \text{vs.} \quad H_a: \beta_0 \neq 0 \]
Slope Hypotheses
\[ H_0: \beta_1 = 0 \quad \text{vs.} \quad H_a: \beta_1 \neq 0 \]
Comment on whether or not we should trust each p-value based on your plots in Part (a).
Should trust each p-value based on your plots in Part (a)?
Intercept Hypotheses p-value? : No, it should not be trusted because of the lack of linearity and the distance zero is from the current data.
Slope Hypotheses p-value? : No, it should not be fully trusted, though there is likely some sort of trend in the data, because of some problems in the diagnostic plots.
Run the following commands in R.
library(car) View(Davis) ?Davis
Reduce the data to just the data for the males. Then perform a regression with weight as the response variable and height as the explanatory variable.
Which of the following provides the estimated average weight of males that are 180 cm tall?
Davey <- filter(Davis, sex == "M")
daveylm <- lm(weight ~ height, data=Davey)
summary(daveylm)
##
## Call:
## lm(formula = weight ~ height, data = Davey)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.869 -6.893 -0.897 6.114 41.122
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -101.3301 29.8617 -3.393 0.00105 **
## height 0.9956 0.1676 5.939 5.92e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.07 on 86 degrees of freedom
## Multiple R-squared: 0.2908, Adjusted R-squared: 0.2826
## F-statistic: 35.27 on 1 and 86 DF, p-value: 5.922e-08
-101.33 +(0.9956)*180 =
-101.33 +(0.9956)*180
## [1] 77.878
View(USArrests) ?USArrests
Perform a regression using this data that explains the average number of murder arrests in cities (per 100,000 in 1973) using the number of assault arrests (per 100,000) in a city.
Select the answer that provides the correct estimate of \(\beta_1\) in the formula:
\(Y_i = \beta_0 + \beta_1X_i + \epsilon_i \sim N(0, \sigma^2)\)
for the USArrests regression described above.
arrestlm <- lm(Murder ~ Assault, data=USArrests)
summary(arrestlm)
##
## Call:
## lm(formula = Murder ~ Assault, data = USArrests)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8528 -1.7456 -0.3979 1.3044 7.9256
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.631683 0.854776 0.739 0.464
## Assault 0.041909 0.004507 9.298 2.6e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.629 on 48 degrees of freedom
## Multiple R-squared: 0.643, Adjusted R-squared: 0.6356
## F-statistic: 86.45 on 1 and 48 DF, p-value: 2.596e-12
Note: the “Line of Equality” shows where the line would need to be if the reported heights were equal (on average) to the actual measured heights. Dots that fall on this line show men that knew their actual height before they were officially measured.
Answer: The average actual height gets closer to the reported height for taller men, while shorter men seem more likely to under-report their height, on average.
A residual is the difference between the actual and predicted
value:
\[ r_i = Y_i - \hat{Y}_i \]
residuals <- resid(model)
SSTO <- sum((y - mean(y))^2)
SSE <- sum(residuals^2)
SSR <- SSTO - SSE
R2 <- SSR / SSTO
R2 # R-squared
## [1] 0.9594401
summary(model)$r.squared
## [1] 0.9594401
datatable(Orange)
orangelm <- lm(circumference ~ age, data=Orange)
summary(orangelm)
##
## Call:
## lm(formula = circumference ~ age, data = Orange)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.310 -14.946 -0.076 19.697 45.111
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.399650 8.622660 2.018 0.0518 .
## age 0.106770 0.008277 12.900 1.93e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.74 on 33 degrees of freedom
## Multiple R-squared: 0.8345, Adjusted R-squared: 0.8295
## F-statistic: 166.4 on 1 and 33 DF, p-value: 1.931e-14
\[\hat{Y_i} = 17.40 + 0.106770 X_i\]
ggplot(Orange, aes(x = age, y = circumference))+
geom_point(color = "orange") +
geom_smooth(method = "lm", formula = y~x, se = FALSE, color = "chocolate")+
theme_minimal()
SSE <- round(sum(residuals(orangelm)^2), 2)
SSR <- round(sum((fitted(orangelm)- mean(Orange$circumference))^2), 2)
SSTO <- round(sum((Orange$circumference - mean(Orange$circumference))^2), 2)
R2 <- round(SSR / SSTO, 2)
correlation <- sqrt(R2)
if(coef(orangelm)[2]<0) correlation <- -correlation
correlation <- round(correlation, 2)
cat("SSE:", SSE, "\n")
## SSE: 18594.74
cat("SSR:", SSR, "\n")
## SSR: 93771.54
cat("SSTO:", SSTO, "\n")
## SSTO: 112366.3
cat("R-squared (R2):", R2, "\n")
## R-squared (R2): 0.83
cat("Correlation (r):", correlation, "\n")
## Correlation (r): 0.91
predict(orangelm, data.frame(age=1095, interval="prediction"))
## 1
## 134.3132
datatable(mtcars)
carwtlm <- lm(mpg ~ wt, data=mtcars)
carcyllm <- lm(mpg ~ cyl, data= mtcars)
carhplm <- lm(mpg ~ hp, data=mtcars)
summary(carwtlm) %>%
pander()
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 37.29 | 1.878 | 19.86 | 8.242e-19 |
| wt | -5.344 | 0.5591 | -9.559 | 1.294e-10 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 32 | 3.046 | 0.7528 | 0.7446 |
summary(carcyllm) %>%
pander()
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 37.88 | 2.074 | 18.27 | 8.369e-18 |
| cyl | -2.876 | 0.3224 | -8.92 | 6.113e-10 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 32 | 3.206 | 0.7262 | 0.7171 |
summary(carhplm) %>%
pander()
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 30.1 | 1.634 | 18.42 | 6.643e-18 |
| hp | -0.06823 | 0.01012 | -6.742 | 1.788e-07 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 32 | 3.863 | 0.6024 | 0.5892 |
ggplot(mtcars, aes(x=wt, y = mpg)) +
geom_point() +
geom_smooth(method = "lm", formula = y~x, se = FALSE, color = "red")+
labs(title = "Explanatory variable is wt")
ggplot(mtcars, aes(x=cyl, y = mpg)) +
geom_point() +
geom_smooth(method = "lm", formula = y~x, se = FALSE, color = "blue")+
labs(title = "Explanatory variable is cyl")
ggplot(mtcars, aes(x=hp, y = mpg)) +
geom_point() +
geom_smooth(method = "lm", formula = y~x, se = FALSE, color = "green")+
labs(title = "Explanatory variable is hp")
# Just change the lms
SSE.wt <- round(sum(residuals(carwtlm)^2), 2)
# y value (prediciton) goes in the mtcar$Y parts!
SSR.wt <- round(sum((fitted(carwtlm)- mean(mtcars$mpg))^2), 2)
SSTO.wt <- round(sum((mtcars$mpg - mean(mtcars$mpg))^2), 2)
R2.wt <- round(SSR.wt / SSTO.wt, 2)
correlation.wt <- sqrt(R2.wt)
if(coef(carwtlm)[2]<0) correlation.wt <- -correlation.wt
correlation.wt <- round(correlation.wt, 2)
cat("SSE:", SSE.wt, "\n")
## SSE: 278.32
cat("SSR:", SSR.wt, "\n")
## SSR: 847.73
cat("SSTO:", SSTO.wt, "\n")
## SSTO: 1126.05
cat("R-squared (R2):", R2.wt, "\n")
## R-squared (R2): 0.75
cat("Correlation (r):", correlation.wt, "\n")
## Correlation (r): -0.87
SSE.cyl <- round(sum(residuals(carcyllm)^2), 2)
# y value goes in the mtcar$Y parts!
SSR.cyl <- round(sum((fitted(carcyllm)- mean(mtcars$mpg))^2), 2)
SSTO.cyl <- round(sum((mtcars$mpg - mean(mtcars$mpg))^2), 2)
R2.cyl <- round(SSR.cyl / SSTO.cyl, 2)
correlation.cyl <- sqrt(R2.cyl)
if(coef(carcyllm)[2]<0) correlation.cyl <- -correlation.cyl
correlation.cyl <- round(correlation.cyl, 2)
cat("SSE:", SSE.cyl, "\n")
## SSE: 308.33
cat("SSR:", SSR.cyl, "\n")
## SSR: 817.71
cat("SSTO:", SSTO.cyl, "\n")
## SSTO: 1126.05
cat("R-squared (R2):", R2.cyl, "\n")
## R-squared (R2): 0.73
cat("Correlation (r):", correlation.cyl, "\n")
## Correlation (r): -0.85
SSE.hp <- round(sum(residuals(carhplm)^2), 2)
# y value goes in the mtcar$Y parts!
SSR.hp <- round(sum((fitted(carhplm)- mean(mtcars$mpg))^2), 2)
SSTO.hp <- round(sum((mtcars$mpg - mean(mtcars$mpg))^2), 2)
R2.hp <- round(SSR.hp / SSTO.hp, 2)
correlation.hp <- sqrt(R2.hp)
if(coef(carhplm)[2]<0) correlation.hp <- -correlation.hp
correlation.hp <- round(correlation.hp, 2)
cat("SSE:", SSE.hp, "\n")
## SSE: 447.67
cat("SSR:", SSR.hp, "\n")
## SSR: 678.37
cat("SSTO:", SSTO.hp, "\n")
## SSTO: 1126.05
cat("R-squared (R2):", R2.hp, "\n")
## R-squared (R2): 0.6
cat("Correlation (r):", correlation.hp, "\n")
## Correlation (r): -0.77
All together:
# Combine the statistics into a data frame
stats_table <- data.frame(
Metric = c("SSE", "SSR", "SSTO", "R_squared", "Correlation"),
wt = c(SSE.wt, SSR.wt, SSTO.wt, R2.wt, correlation.wt),
cyl = c(SSE.cyl, SSR.cyl, SSTO.cyl, R2.cyl, correlation.cyl),
hp = c(SSE.hp, SSR.hp, SSTO.hp, R2.hp, correlation.hp)
)
# Print the table
print(stats_table)
## Metric wt cyl hp
## 1 SSE 278.32 308.33 447.67
## 2 SSR 847.73 817.71 678.37
## 3 SSTO 1126.05 1126.05 1126.05
## 4 R_squared 0.75 0.73 0.60
## 5 Correlation -0.87 -0.85 -0.77
# Optional: Format the table with nice rounding for display
kable(stats_table, caption = "Summary Statistics for mtcars dataset Regression Models")
| Metric | wt | cyl | hp |
|---|---|---|---|
| SSE | 278.32 | 308.33 | 447.67 |
| SSR | 847.73 | 817.71 | 678.37 |
| SSTO | 1126.05 | 1126.05 | 1126.05 |
| R_squared | 0.75 | 0.73 | 0.60 |
| Correlation | -0.87 | -0.85 | -0.77 |
par(mfrow=c(1,3))
plot(carwtlm, which=1)
qqPlot(carwtlm, id=FALSE, main= "Q-Q plot", col="darkred", col.lines = "red", pch = 16)
plot(carwtlm$residuals, main="Residuals vs Order")
par(mfrow=c(1,3))
plot(carcyllm, which=1)
qqPlot(carcyllm, id=FALSE, main= "Q-Q plot", col="darkblue", col.lines = "lightblue", pch = 16)
plot(carcyllm$residuals, main="Residuals vs Order")
par(mfrow=c(1,3))
plot(carhplm, which=1)
qqPlot(carhplm, id=FALSE, main= "Q-Q plot", col="darkgreen", col.lines = "green", pch = 16)
plot(carhplm$residuals, main="Residuals vs Order")
The y-values of the regression are: 3.78, 6.08, 6.65, 9.25, and 9.92.
The residuals from the regression are: -0.266, 0.489, -0.486, 0.569, and -0.306.
What is the R-squared value for this regression?
y <- c(3.78, 6.08, 6.65, 9.25, 9.92)
SSTO <- sum( (y - mean(y))^2 )
#SSTO = 24.83372
res <- c(-0.266, 0.489, -0.486, 0.569, -0.306)
SSE <- sum(res^2)
#SSE = 0.96347
rr <- 1 - SSE/SSTO
print(rr)
## [1] 0.9612032
This data can be used to show that the displacement of the engine (disp) is positively correlated with the weight of the vehicle (wt).
What is the R-squared value of this regression?
mt.lm <- lm(disp ~ wt, data=mtcars)
summary(mt.lm)
##
## Call:
## lm(formula = disp ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -88.18 -33.62 -10.05 35.15 125.59
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -131.15 35.72 -3.672 0.000933 ***
## wt 112.48 10.64 10.576 1.22e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 57.94 on 30 degrees of freedom
## Multiple R-squared: 0.7885, Adjusted R-squared: 0.7815
## F-statistic: 111.8 on 1 and 30 DF, p-value: 1.222e-11
Perform a regression of mpg on weight of the vehicle (mpg ~ wt).
A certain statistics teacher at BYU-Idaho drives a 2001 Nissan Sentra that weighs approximately 2,700 lbs and currently gets only 21 mpg. Based on the regression you performed, how many mpg above or below average is this vehicle?
mt.lm <- lm(mpg ~ wt, data=mtcars) #perform the regression
plot(mpg ~ wt, data=mtcars) #draw the regression (not needed, but nice)
abline(mt.lm) #add the regression line (not needed, but nice)
points(2.7, 21, pch=16, col="skyblue") #add the Nissan Sentra value (not needed, but nice)
predict(mt.lm, data.frame(wt=2.7)) #get predicted value for Nissan Sentra
## 1
## 22.85505
1
## [1] 1
22.85505
## [1] 22.85505
lines(c(2.7, 2.7), c(21, 22.85505), col="skyblue") #add residual line (not needed, but nice)
myresidual <- 21 - 22.85505 #calculate difference between Y and Y-hat
print(myresidual)
## [1] -1.85505
par(mfrow = c(2, 2))
plot(model)
Sometimes transforming \(Y\) helps meet assumptions:
# Combine x and y into a clean data frame
data_clean <- data.frame(x = x, y = y)
data_clean <- na.omit(data_clean) # Drop rows with NA
# Fit the model
model <- lm(y ~ x, data = data_clean)
# Box-Cox transformation
car::boxCox(model)
# Continue with log transformation
y_log <- log(data_clean$y)
model_log <- lm(y_log ~ data_clean$x)
# Plotting
plot(data_clean$x, y_log, main = "Log-Transformed Regression", pch = 19)
abline(model_log, col = "red", lwd = 2)
Problem 1:
Open the Davis dataset in R, found in
library(car). As stated in the help file for this data set,
“The subjects were men and women engaged in regular exercise.”
Perform a simple linear regression of the height of the individual based on their weight.
PART (A)
Type out the mathematical equation for this regression model and label both \(Y\) and \(X\) in the equation.
\[ \underbrace{Y_i}_\text{height} = \beta_0 + \beta_1 \underbrace{X_i}_\text{weight} + \epsilon_i \quad \text{where} \epsilon_i \sim N(0, \sigma^2) \]
PART (B)
Plot a scatterplot of the data with your regression line overlaid.
davelm <- lm(height ~ weight, data= Davis)
plot(height ~ weight, data= Davis)
abline(davelm)
PART (C)
Create a residuals vs fitted-values plot for this regression. What does this plot show?
par(mfrow=c(1,3))
plot(davelm, which=1)
What does the plot SHOW? - The point in the
bottom right corner of the plot labeled “12” (observation 12) appears to
be a dramatic outlier causing the regression line to be pulled towards
it.
PART (D)
State and interpret the slope, y-intercept, and \(R^2\) of this regression. Are they meaningful for this data under the current regression?
summary(davelm)
##
## Call:
## lm(formula = height ~ weight, data = Davis)
##
## Residuals:
## Min 1Q Median 3Q Max
## -128.137 -4.758 0.280 6.088 25.441
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 160.09312 3.74677 42.728 < 2e-16 ***
## weight 0.15086 0.05551 2.718 0.00715 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.82 on 198 degrees of freedom
## Multiple R-squared: 0.03597, Adjusted R-squared: 0.0311
## F-statistic: 7.387 on 1 and 198 DF, p-value: 0.007152
| Part | Meaning |
|---|---|
| Slope = 0.15 | The slope is the increase int eh average height per one pound change in weight. |
| Y- intercept = 160.09 | The y-intercept is the average height of exercising individuals who have a weight of zero (To be put nicely, a bunch of baloga) |
| \(R^2\) = 0.04 | The proportion of variation in height explained by this regression (with the outlier included) is essentially zero |
Are these values meaningful for this data under the current regression?
PART (E)
Run View(Davis) in your Console. What do you notice
about observation #12 in this data set?
Perform a second regression for this data with observation #12 removed. Recreate the scatterplot of Part (b) with two regression lines showing this time. The first regression line should include the outlier. The second should exclude the outlier. Include a legend to show which line is which.
daniel <- filter(Davis, weight != "166")
dangdaniel <- lm(height ~ weight, data = daniel)
ggplot(Davis, aes(x = weight, y = height)) +
geom_point(size = 1.5, shape = 19, color = "darkgrey", alpha = 1) +
geom_smooth(method = "lm", formula = y ~ x, se = FALSE, aes(color = "Fitted Regression (with outlier)"), size = 1.5) +
geom_smooth(data = daniel, aes(x = weight, y = height, color = "Fitted Regression (outlier removed)"), method = "lm", formula = y ~ x, se = FALSE, size = 1.5) +
scale_color_manual(values = c("Fitted Regression (with outlier)" = "gray", "Fitted Regression (outlier removed)" = "skyblue")) +
labs(color= "Regression Lines") +
theme_classic() +
theme(legend.position = c(0.11,0.13)) +
labs(title = "Exercising Individuals (Davis data set)", x= "Measured Weight of Individual in kg (weight)", y= "Measured Height of Individual in cm (height)")
## Warning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2
## 3.5.0.
## ℹ Please use the `legend.position.inside` argument of `theme()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
PART (F)
Compute the slope, y-intercept, and \(R^2\) value for the regression with the outlier removed. compare the results to the values when the outlier was present.
summary(dangdaniel)
##
## Call:
## lm(formula = height ~ weight, data = daniel)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.347 -3.727 -0.232 3.604 20.880
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 136.83661 2.02882 67.45 <2e-16 ***
## weight 0.51689 0.03044 16.98 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.716 on 197 degrees of freedom
## Multiple R-squared: 0.594, Adjusted R-squared: 0.592
## F-statistic: 288.3 on 1 and 197 DF, p-value: < 2.2e-16
WITHOUT outlier:
WITH outlier:
PART (G)
Create a residuals vs fitted-values plot for the regression with the outlier removed. How do things look now?
par(mfrow=c(1,3))
plot(dangdaniel, which=1)
Problem 2:
Open the Prestige data set found in
library(car).
Perform a regression that explains the 1971 average annual income from jobs according to their “Pineo-Porter prestige score for occupation, from a social survey conducted in the mod-1960’s.”
PART (A)
Plot the data and fitted simple linear regression line.
presto <- lm(income ~ prestige, data= Prestige)
ggplot(Prestige, aes(x = prestige, y=income)) +
geom_point(size = 1.5, shape = 19, color = "green", alpha = 1) +
geom_smooth(method = "lm", formula = y ~ x, se = FALSE, color = "grey")+
labs(title= "Greater Prestige linked to Greater Income (Prestige data set)", x= "Prestige Ranking of Occupation(prestige)", y="Average Annual Income USD (income)")+
theme_classic()
PART (B)
State the estimated values for \(\beta_0\), \(\beta_1\), and \(\sigma\) for this regression.
summary(presto)
##
## Call:
## lm(formula = income ~ prestige, data = Prestige)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6693.3 -1434.9 136.5 1251.8 15152.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1465.03 860.47 -1.703 0.0918 .
## prestige 176.43 17.26 10.224 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2984 on 100 degrees of freedom
## Multiple R-squared: 0.5111, Adjusted R-squared: 0.5062
## F-statistic: 104.5 on 1 and 100 DF, p-value: < 2.2e-16
PART (C)
Create a residuals vs fitted-values plot and a Q-Q Plot of the residuals for this regression.
par(mfrow=c(1,3))
plot(presto, which=1:2)
PART (D)
Comment on any difficulties the diagnostic plots in Part (c) reveal about the regression.
Comment on which estimates of Part (b) are likely effected by these difficulties.
Problem 3:
Open the Burt data set from library(car).
This data set is famous for being fraudulent, or fake. See ?Burt for more details. One of the first indicators that it was fraudulent was revealed by regressing IQbio ~ IQfoster. This regression was just a little too good to be real. (Note that for social science data, like this data, \(R^2\) values above 0.3 are impressive. Values above 0.7 are rare.)
PART (A)
Plot the data and fitted regression line. State the estimated values of \(\beta_0\), \(\beta_1\), and \(\sigma\) as well as the \(R^2\) of the regression.
ernie <- lm(IQbio ~ IQfoster, data= Burt)
ggplot(Burt, aes(x= IQfoster, y = IQbio))+
geom_point(size = 1.5, shape = 19, color = "wheat", alpha=1.5) +
geom_smooth(method = "lm",formula = y~x, se=FALSE, color = "burlywood")+
labs(title= "Linked IQ? \n (Burt data set)", x="Twin IQ, Raised by Foster Parents (IQfoster)", y= "Twin IQ, Raised by Biological Parents (IQbio)")+
theme_classic()
PART (B)
Create a (1) residuals vs. fitted-values plot, (2) Q-Q Plot of the residuals, and (3) residuals vs. order plot for this regression. Are any problems with regression assumption violations visible in these plots?
par(mfrow=c(1,3))
plot(ernie, which=1:2)
plot(ernie$residuals)
PART (C)
Comment on what the three diagnostic plots of Part (b) show for the regression.
Problem 4:
Open the mtcars data set in R.
Perform a regression of mpg explained by the displacement of the vehicle’s engine.
PART (A)
Plot the data and fitted regression line. State the estimated values of \(\beta_0\), \(\beta_1\), and \(\sigma\) as well as the \(R^2\) of the regression.
haveyouseenthecarsmovie <- lm(mpg ~ disp, data= mtcars)
ggplot(mtcars, aes(x= disp, y= mpg))+
geom_point(size = 1.5, shape = 19, color = "limegreen", alpha = 2) +
geom_smooth(method = "lm",formula = y~x, se = FALSE, color = "limegreen") +
labs(title = "Reduced Fuel Efficiency \n (mtcars data set)", x= "Engine Displacement cu. in. (disp)", y= "Gas Mileage (mpg)") +
theme_classic()
summary(haveyouseenthecarsmovie)
##
## Call:
## lm(formula = mpg ~ disp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8922 -2.2022 -0.9631 1.6272 7.2305
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.599855 1.229720 24.070 < 2e-16 ***
## disp -0.041215 0.004712 -8.747 9.38e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.251 on 30 degrees of freedom
## Multiple R-squared: 0.7183, Adjusted R-squared: 0.709
## F-statistic: 76.51 on 1 and 30 DF, p-value: 9.38e-10
PART (B)
Create a (1) residuals vs. fitted-values plot, (2) Q-Q Plot of the residuals, and (3) residuals vs. order plot for this regression. Are any problems with regression assumption violations visible in these plots?
par(mfrow=c(1,3))
plot(haveyouseenthecarsmovie, which=1)
qqPlot(haveyouseenthecarsmovie$residuals, main="Q-Q Plot", col.lines="blue",pch= 19, id=FALSE)
plot(haveyouseenthecarsmovie$residuals, ylab= "Residuals", main="Residuals vs Order")
PART (C)
Comment on what the three diagnostic plots of Part (b) show for the validity of the values computed in Part (a).
Problem 5:
Open the Orange data set found in R.
Perform a regression that explains the circumference of the trunk of the orange tree as the tree ages.
PART (A)
Plot the data and fitted simple linear regression line.
ilikeapples <- lm(circumference ~ age, data=Orange)
ggplot(Orange, aes(x= age, y= circumference))+
geom_point(size = 1.5, shape = 19, color = "orange", alpha = 1.5) +
geom_smooth(method = "lm", formula = y~x, se= FALSE, color = "grey") +
labs(title = "Growth of Orange Trees", x = "Age of Tree in Days", y = "Circumference of Tree(mm)")+
theme_classic()
PART (B)
State the estimated values for \(\beta_0\), \(\beta_1\), and \(\sigma\) for this regression.
summary(ilikeapples)
##
## Call:
## lm(formula = circumference ~ age, data = Orange)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.310 -14.946 -0.076 19.697 45.111
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.399650 8.622660 2.018 0.0518 .
## age 0.106770 0.008277 12.900 1.93e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.74 on 33 degrees of freedom
## Multiple R-squared: 0.8345, Adjusted R-squared: 0.8295
## F-statistic: 166.4 on 1 and 33 DF, p-value: 1.931e-14
PART (C)
Create a residuals vs fitted-values plot and a Q-Q Plot of the residuals for this regression.
par(mfrow=c(1,3))
plot(ilikeapples, which=1)
qqPlot(ilikeapples$residuals, main="Q-Q Plot", col.lines="blue",pch= 19, id=FALSE)
plot(ilikeapples$residuals, ylab= "Residuals", main="Residuals vs Order")
PART (D)
Comment on any difficulties the diagnostic plots in Part (c) reveal about the regression.
Comment on which estimates of Part (b) are likely effected by these difficulties.
Difficulties:
PART (E)
Perform a Box-Cox analysis of the regression. Which Y-transformation is suggested?
boxCox(ilikeapples)
PART (F)
Perform a regression with the transformed y-variable. Plot the regression in the transformed units. Diagnose the fit of the regression on the transformed data.
notmyapples <- lm(sqrt(circumference) ~ age, data=Orange)
b.sqrt <- notmyapples$coef
ggplot(Orange, aes(x= age, y= sqrt(circumference)))+
geom_point(size = 1.5, shape = 19, color = "orange", alpha = 1.5) +
geom_smooth(method = "lm", formula = y~x, se= FALSE, color = "grey") +
labs(title = "Growth of Orange Trees", x = "Age of Tree in Days", y = "Circumference of Tree(mm)")+
theme_classic()
par(mfrow=c(1,3))
plot(notmyapples, which=1)
qqPlot(notmyapples$residuals, main="Q-Q Plot", col.lines="blue",pch= 19, id=FALSE)
plot(notmyapples$residuals, ylab= "Residuals", main="Residuals vs Order")
Which of the following have actually become more problematic when comparing the diagnostic plots of the original regression to the diagnostic plots of the transformed regression?
PART (G)
Write out the fitted model for \(\hat{Y}_i'\). Then solve the transformed model back into the original units for \(\hat{Y}_i\). Then compute the following.
When \(X_i = 500\), then \(\hat{Y}_i = \ldots\).
\[ \hat{Y}_i' = 5.3408 + 0.0055 X_i \]
\[ \sqrt{\hat{Y}_i'} = 5.3408 + 0.0055 X_i \]
\[ \hat{Y}_i = (5.3408 + 0.0055 X_i)^2 \] $ = $
(5.3408 + 0.0055 * 500)^2
## [1] 65.46104
PART (H)
Plot the data in the original units. Place the transformed line, back in the original units, on this plot.
plot(circumference ~ age, data=Orange, pch=16, col="orangered", main="Growth of Orange Trees", xlab="Age of Tree in Days", ylab="Circumference of Tree (mm)")
abline(ilikeapples, col= "gray")
curve( (b.sqrt[1] + b.sqrt[2]*x)^2 , add=TRUE, col="orangered")
Residual Plot:
Original Plot:
Select the appropriate equation for \(\hat{Y_i}\)
Answer: \(\hat{Y_i} = \frac{1}{\sqrt{830.13 - 10.93 X_i}}\)
kachow <- lm(dist ~ speed, data = cars)
boxCox(kachow)
Base R Version:
lm.scoop <- lm(sqrt(dist) ~ speed, data = cars)
b.scoop <- coef(lm.scoop)
plot(dist ~ speed, data = cars)
curve( (b.scoop[1] + b.scoop[2]*x)^2, add=TRUE, col = "yellow")
abline(kachow)
ggplot Version:
ggplot(Orange, aes(x=age, y=circumference)) +
geom_point(color = "orangered") +
stat_function(fun=function(x)(b.scoop[1] + b.scoop[2]*x)^2, color= "yellow") +
theme_classic()
b1 <- coef(model)[2]
se_b1 <- summary(model)$coefficients[2, 2]
t_stat <- b1 / se_b1
df <- n - 2
p_val <- 2 * pt(-abs(t_stat), df)
t_stat
## x
## -0.6161284
p_val
## x
## 0.5407208
Problem 1 Install the Ecdat library in
R: install.packages("Ecdat").
From library(Ecdat) open the Caschool data
set in R. As stated in the help file for this data set, this data is a
collection of measurements on 420 different school districts from
California during the 1998-1999 school year.
The school districts in California offer a reduced-price lunch program. This is in a way, a measure of the poverty of the student body of the school district. We will assume that the higher the percentage of participants, the greater the general level of poverty. The question is, does the poverty level (or at least the percentage of participation in the reduced-lunch program) predict how well the student body will perform overall on a standardized test?
> ?Caschool
> View(Caschool)
Part (a)
Type out the mathematical equation for this regression model and label both \(Y\) and \(X\) in the equation.
\[ \underbrace{Y_i}_\text{test score} = \beta_0 + \beta_1 \underbrace{X_i}_\text{lunch percentage} + \epsilon_i \]
Part (b)
Plot a scatterplot of the data with your regression line overlaid. Write out the fitted regression equation.
library(Ecdat)
lunch <- lm(testscr ~ mealpct,data= Caschool)
plot(testscr ~ mealpct,data= Caschool)
abline(lunch)
\[ \hat{Y}_i = 681.43952 + (-0.61029)X_i \]
Part (c)
Report the test statistics and p-values for the following hypotheses.
\[ \begin{array}{l} H_0: \beta_0 = 0 \\ H_a: \beta_0 \neq 0 \\ \end{array} \quad \begin{array}{l} H_0: \beta_1 = 0 \\ H_a: \beta_1 \neq 0 \\ \end{array} \]
summary(lunch)
##
## Call:
## lm(formula = testscr ~ mealpct, data = Caschool)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.539 -6.038 -0.647 6.245 35.543
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 681.43952 0.88943 766.16 <2e-16 ***
## mealpct -0.61029 0.01701 -35.87 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.447 on 418 degrees of freedom
## Multiple R-squared: 0.7548, Adjusted R-squared: 0.7542
## F-statistic: 1286 on 1 and 418 DF, p-value: < 2.2e-16
The test statistics in this case provide the number of standard errors that the estimated values of the parameter sits from the hypothesized value of the true parameter.
Part (d)
State the slope, y-intercept, and \(R^2\) of this regression. Further, provide 95% confidence intervals for the slope and intercept. Interpret the values.
summary(lunch)
##
## Call:
## lm(formula = testscr ~ mealpct, data = Caschool)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.539 -6.038 -0.647 6.245 35.543
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 681.43952 0.88943 766.16 <2e-16 ***
## mealpct -0.61029 0.01701 -35.87 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.447 on 418 degrees of freedom
## Multiple R-squared: 0.7548, Adjusted R-squared: 0.7542
## F-statistic: 1286 on 1 and 418 DF, p-value: < 2.2e-16
confint(lunch, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 679.6912170 683.1878250
## mealpct -0.6437314 -0.5768403
As shown by the R-squared value the regression is fairly good at explaining the variablility in Y. This is further witness by how accurate the confidence intervals are for the slope and intercept.
Part (e)
Create a residuals vs fitted-values plot and Q-Q Plot of the residuals for this regression. What do these plots show?
Answer: The residauls vs fitted-values plot shows a nice linear pattern and very constant variance with the exception of three possible outliers in points 367, 163, and 180 - The Q-Q Plot shows the residuals can very safely be assumed to be normally distributed. To prove that you made the plot, the dot in the top-right of the plot is labeled as point #367
par(mfrow=c(1,3))
plot(lunch, which= 1:2)
plot(lunch$residuals)
Problem 2
Open the Clothing data set from library(Ecdat).
Although this data is from 1990, it contains two interesting
variables (1) the total tsales of the clothing stores and
(2) the average number of hours worked per employee during the year,
hourspw.
> ?Clothing
> View(Clothing)
Part (a)
Type out the mathematical equation for this regression model and label both \(Y\) and \(X\) in the equation.
\[ \underbrace{Y_i}_\text{Total Sales} = \beta_0 + \beta_1 \underbrace{X_i}_\text{Hours Per Employees} + \epsilon_i \]
Part (b)
Plot a scatterplot of the data with your regression line overlaid. Write out the fitted regression equation.
mahclothes <- lm(tsales ~ hourspw,data= Clothing)
plot(tsales ~ hourspw,data= Clothing)
abline(mahclothes)
\[ \hat{Y}_i = b_0 + b_1 X_i + \epsilon_i \]
Part (c)
Report the test statistics and p-values given by your summary(…) in R for the following hypotheses.
\[ \begin{array}{l} H_0: \beta_0 = 0 \\ H_a: \beta_0 \neq 0 \\ \end{array} \quad \begin{array}{l} H_0: \beta_1 = 0 \\ H_a: \beta_1 \neq 0 \\ \end{array} \]
summary(mahclothes)
##
## Call:
## lm(formula = tsales ~ hourspw, data = Clothing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -974653 -266420 -65970 131524 4721088
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1745 67479 0.026 0.979
## hourspw 43885 3320 13.218 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 487000 on 398 degrees of freedom
## Multiple R-squared: 0.3051, Adjusted R-squared: 0.3033
## F-statistic: 174.7 on 1 and 398 DF, p-value: < 2.2e-16
Type your answer here…
Part (d)
Now, use your own calculations to obtain test statistics and p-values for the following hypotheses.
You may find useful information on how to do this in the “Explanation” tab under “t Tests” from your Math 325 Notebook, Simple Linear Regression page.
\[ \begin{array}{l} H_0: \beta_0 = 1500 \\ H_a: \beta_0 \neq 1500 \\ \end{array} \quad \begin{array}{l} H_0: \beta_1 = 35000 \\ H_a: \beta_1 \neq 35000 \\ \end{array} \]
Note that these hypotheses come from previous knowledge about clothing sales and employee hours. They state that in years past, the average annual sales when no employees worked any hours on average, was 1500. And that as average eployee hours worked increases by 1 hour, the average total annual sales increases by 35,000. The question now, is if the earning pattern has changed from what it used to be.
\[t = \frac{\bar{x}-\mu}{s/\sqrt{n}}\]
OR
\[t = \frac{Estimate(b_0/b_1) -\mu}{Std. Error(b_0/b_1)}\]
# Intercept
(1745 - 1500)/67479
## [1] 0.003630759
# Slope
(43885 - 35000)/3320
## [1] 2.676205
Part (e)
State the slope, y-intercept, and \(R^2\) of this regression. Further, provide 95% confidence intervals for the slope and intercept. Interpret the values.
summary(mahclothes)
##
## Call:
## lm(formula = tsales ~ hourspw, data = Clothing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -974653 -266420 -65970 131524 4721088
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1745 67479 0.026 0.979
## hourspw 43885 3320 13.218 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 487000 on 398 degrees of freedom
## Multiple R-squared: 0.3051, Adjusted R-squared: 0.3033
## F-statistic: 174.7 on 1 and 398 DF, p-value: < 2.2e-16
confint(mahclothes, level= 0.95)
## 2.5 % 97.5 %
## (Intercept) -130915.29 134404.41
## hourspw 37357.77 50411.97
Part (f)
Create a residuals vs fitted-values plot and Q-Q Plot of the residuals for this regression. What do these plots show?
They show some possible difficulties with linearity and constant variance, but more importantly observation number 397 is an extreme outlier that should be removed. This outlier is causing the estimate of the variance of the error terms , called the MSE, to be much larger than it should be, thus causing the \(R^2\) value to be lower than it actually should be.
In fact, removing the outlier and re-running the regression increases the \(R^2\) value to 0.393. (Round to 3 decimal places.)
par(mfrow=c(1,3))
plot(mahclothes, which= 1:2)
plot(mahclothes$residuals)
Part (g)
Do any x-transformations or y-transformations improve the regression? If so, which ones?
The Box-Cox transformation suggests a 0.25 transformation on Y which is further inproved by a log transfromation on X. So the final plot looks like.
boxCox(mahclothes)
plot(tsales ~ log(hourspw),data= Clothing)
Question 1: Below is the summary output from a simple linear regression performed in R. What is the value of the missing test statistic?
Answer: 3.481
Question 2: Below is the summary output from a simple linear regression performed in R. Compute the 95% confidence interval for the slope estimate.
Answer: (7.701, 9.423)
qt(1-0.05/2, 397) #gives the critical value for t* = 1.965957
and std. error = 0.4379 #as shown in the summary output
So,
\[\underbrace{8.5621 - 1.965957 * 0.4379}_\text{lower bound}\]
\[\underbrace{8.5621 + 1.965957 * 0.4379}_\text{Upper bound}\]
If you used instead:
\[8.5621 +/- 2 * 0.4379\]
then you would have still come close to the correct answer, but it would not be as correct as using the actual critical value from the t distribution with 397 degrees of freedom.
Question 3: Suppose a 95% confidence interval for the true regression slope is obtained as (-7.453, -0.947) from a sample of 100 x-y points. What is the p-value for the test of the following hypotheses?
\[H_0 : \beta_1 = -10 \\ H_a : \beta_1 \neq -10\]
Answer: 0.0006175379
Problem 1
Install the alr3 library in R:
install.packages("alr3").
From library(alr3) open the BGSall data set
in R. As stated in the help file for this data set, this data is a
collection of measurements on “children born in 1928-29 in Berkeley,
CA.”
> ?BGSall
> View(BGSall)
A standing tradition is that if you measure a child when they are 2-years old, and double their height, this will give a good prediction on their final adult height. Let’s see if this is true.
Perform a regression that could predict a child’s 18-year old height from their 2-year old height.
Part (a)
Type out the mathematical equation for this regression model and label both \(Y\) and \(X\) in the equation.
\[ \underbrace{Y_i}_\text{Height 18} = \beta_0 + \beta_1 \underbrace{X_i}_\text{Height 2} + \epsilon_i \quad \text{where} \ \epsilon_i \sim N(0, \sigma^2) \]
Part (b)
Plot a scatterplot of the data with your regression line overlaid. Write out the fitted regression equation.
tall <- lm(HT18 ~ HT2, data= BGSall)
plot(HT18 ~ HT2, data= BGSall)
abline(tall)
\[ \hat{Y}_i = 45.7966 + 1.441 X_i \]
Part (c)
Report the test statistics and p-values for the following hypotheses. (The hypotheses about \(\beta_0\) claim that at age 0-years, a child should have height 0 cm. The hypotheses about \(\beta_1\) claim that height doubles from age 2 to 18.)
\[ \begin{array}{l} H_0: \beta_0 = 0 \\ H_a: \beta_0 \neq 0 \\ \end{array} \quad \begin{array}{l} H_0: \beta_1 = 2 \\ H_a: \beta_1 \neq 2 \\ \end{array} \]
summary(tall)
##
## Call:
## lm(formula = HT18 ~ HT2, data = BGSall)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5406 -5.9807 -0.3401 5.7771 17.3163
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.7966 16.7006 2.742 0.00694 **
## HT2 1.4441 0.1901 7.597 4.67e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.422 on 134 degrees of freedom
## Multiple R-squared: 0.301, Adjusted R-squared: 0.2958
## F-statistic: 57.71 on 1 and 134 DF, p-value: 4.67e-12
Type your answer here…
Part (d)
State the slope, y-intercept, and \(R^2\) of this regression. Further, provide 95% confidence intervals for the slope and intercept. Interpret the values.
summary(tall)
##
## Call:
## lm(formula = HT18 ~ HT2, data = BGSall)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5406 -5.9807 -0.3401 5.7771 17.3163
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.7966 16.7006 2.742 0.00694 **
## HT2 1.4441 0.1901 7.597 4.67e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.422 on 134 degrees of freedom
## Multiple R-squared: 0.301, Adjusted R-squared: 0.2958
## F-statistic: 57.71 on 1 and 134 DF, p-value: 4.67e-12
confint(tall, level=0.95)
## 2.5 % 97.5 %
## (Intercept) 12.765715 78.827522
## HT2 1.068108 1.820011
Part (e)
Create a residuals vs fitted-values plot and Q-Q Plot of the residuals for this regression. What do these plots show?
par(mfrow=c(1,3))
plot(tall, which= 1:2)
Part (f)
A certain stats teacher at BYU-Idaho has a son named Andrew who turns 2-years old in a couple of weeks. He is currently 2 feet 9 inches tall. Predict his 18-year old height in centimeters with 95% confidence.
predict(tall, data.frame(HT2=83.82), interval="prediction")
## fit lwr upr
## 1 166.8377 152.0294 181.646
Problem 2
Open the wblake data set from library(alr4).
> ?wblake
> View(wblake)
If you love fishing, then you might like this data. The goal of this data was to see if there was a link in the radius size of a key “Scale” of a fish and the overall “Length” of the fish.
Part (a)
Type out the mathematical equation for this regression model and label both \(Y\) and \(X\) in the equation.
\[ \underbrace{Y_i}_\text{Scale} = \beta_0 + \beta_1 \underbrace{X_i}_\text{Length} + \epsilon_i \quad \text{where} \ \epsilon_i \sim N(0, \sigma^2) \]
Part (b)
Plot a scatterplot of the data with your regression line overlaid. Write out the fitted regression equation. State the \(R^2\) value of the regression.
fishy.lm <- lm(Scale ~ Length, data = wblake)
summary(fishy.lm)
##
## Call:
## lm(formula = Scale ~ Length, data = wblake)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9240 -0.5011 -0.1639 0.1910 3.9795
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.4307592 0.1356603 -10.55 <2e-16 ***
## Length 0.0378026 0.0006644 56.90 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9288 on 437 degrees of freedom
## Multiple R-squared: 0.8811, Adjusted R-squared: 0.8808
## F-statistic: 3237 on 1 and 437 DF, p-value: < 2.2e-16
\[ \underbrace{Y_i}_\text{Scale} = \beta_0 + \beta_1 \underbrace{X_i}_\text{Length} + \epsilon_i \quad \text{where} \ \epsilon_i \sim N(0, \sigma^2) \]
plot(Scale ~ Length, data = wblake)
Part (c)
Diagnose the regression with a residuals vs. fitted-values plot. Determine which Y-transformation is suggested for this data.
par(mfrow=c(1,2))
plot(fishy.lm, which=1)
qqPlot(fishy.lm$residuals, main="Q-Q Plot", col="darkolivegreen", col.lines="darkgreen",pch= 19, id=FALSE)
boxCox(fishy.lm)
Part (d)
Perform a regression of the form \(Y' = Y^\lambda\). Use your answer to Part (c) to select \(\lambda\).
Plot the regression in the transformed space, \(Y' \sim X\) and add the fitted regression to the plot.
plot(sqrt(Scale) ~ Length, data = wblake)
Part (e)
State the slope, y-intercept, and \(R^2\) of this transformed regression.
fisher.lm <- lm(sqrt(Scale) ~ Length, data = wblake)
summary(fisher.lm)
##
## Call:
## lm(formula = sqrt(Scale) ~ Length, data = wblake)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.55799 -0.11064 -0.03639 0.05285 0.62613
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.7881700 0.0263782 29.88 <2e-16 ***
## Length 0.0081115 0.0001292 62.79 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1806 on 437 degrees of freedom
## Multiple R-squared: 0.9002, Adjusted R-squared: 0.9
## F-statistic: 3942 on 1 and 437 DF, p-value: < 2.2e-16
Part (f)
Create a residuals vs fitted-values plot and Q-Q Plot of the residuals for this trasnformed regression. Does this regression look better than the original?
par(mfrow=c(1,2))
plot(fisher.lm, which=1)
qqPlot(fisher.lm$residuals, main="Q-Q Plot", col="darkolivegreen", col.lines="darkgreen",pch= 19, id=FALSE)
Part (g)
Untransform the fitted regression equation and draw it on a scatterplot of the original data. Include the original regression line on this plot.
# Type your code here...
Part (h)
Place two prediction intervals for the Scale radius when the Length of the fish is 250 on your scatterplot of the data in the original units.
Show the prediction interval from the original regression in the original units.
Show the prediction interval from the transformed regression back on the original units.
predict(fishy.lm,data.frame(Length = 250), interval="prediction")
## fit lwr upr
## 1 8.019889 6.19091 9.848869
predict(fisher.lm,data.frame(Length = 250), interval="prediction")^2
## fit lwr upr
## 1 7.93008 6.053607 10.0595
We can include categorical variables in a regression model using dummy (0/1) variables.
set.seed(42)
group <- rep(c("A", "B"), each = 50)
x <- c(rnorm(50, 5, 1), rnorm(50, 5, 1))
y <- c(2 + 3 * x[1:50] + rnorm(50), 4 + 1.5 * x[51:100] + rnorm(50))
data <- data.frame(x, group, y)
model_two_lines <- lm(y ~ x * group, data = data)
summary(model_two_lines)
##
## Call:
## lm(formula = y ~ x * group, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.72823 -0.55004 0.00611 0.45243 3.07342
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.4315 0.5764 2.484 0.01473 *
## x 3.0840 0.1132 27.255 < 2e-16 ***
## groupB 2.9073 0.9301 3.126 0.00235 **
## x:groupB -1.6551 0.1807 -9.160 9.47e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9121 on 96 degrees of freedom
## Multiple R-squared: 0.9444, Adjusted R-squared: 0.9427
## F-statistic: 544 on 3 and 96 DF, p-value: < 2.2e-16
library(ggplot2)
ggplot(data, aes(x = x, y = y, color = group)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'
Problem 1
Consider the scatterplot shown here of a single residence’s monthly
gas bill according to the month of the year. See ?Utilities
for more details on the data.
plot(gasbill ~ month, data=Utilities, main="Single Residence in Minnesota", xlab="Month of the Year", ylab="Monthly Gas Bill (US Dollars)")
Part (a)
Add the estimated regression function to the scatterplot above and state the function below.
lm.gassy <- lm(gasbill ~ month + I(month^2), data=Utilities)
summary(lm.gassy)
##
## Call:
## lm(formula = gasbill ~ month + I(month^2), data = Utilities)
##
## Residuals:
## Min 1Q Median 3Q Max
## -117.440 -15.127 -0.692 10.528 86.617
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 284.4558 10.5374 27.00 <2e-16 ***
## month -76.6165 3.6609 -20.93 <2e-16 ***
## I(month^2) 5.4807 0.2712 20.21 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.48 on 114 degrees of freedom
## Multiple R-squared: 0.7937, Adjusted R-squared: 0.7901
## F-statistic: 219.3 on 2 and 114 DF, p-value: < 2.2e-16
b <- coef(lm.gassy)
plot(gasbill ~ month, data=Utilities, main="Single Residence in Minnesota", xlab="Month of the Year", ylab="Monthly Gas Bill (US Dollars)")
curve(b[1] + b[2]*x + b[3]*x^2, col="skyblue", lwd=2, add=TRUE)
\[ \hat{Y}_i = ... \]
Part (b)
Diagnose the appropriateness of this regression model. How well does it fit the data?
Be sure to provide diagnostic plots and supporting arguments for your claims.
par(mfrow=c(1,3))
plot(lm.gassy, which=1)
qqPlot(lm.gassy, main= "Q-Q plot", pch = 16)
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## [1] 2 4
plot(lm.gassy$residuals, main="Residuals vs Order")
Part (c)
What range of possible gas bill amounts do you predict for the September bill? How confident are you in your prediction?
predict(lm.gassy, data.frame(month = 9),interval = "prediction")
## fit lwr upr
## 1 38.84658 -22.00324 99.69641
Type your answer here…
Part (d)
Compute the mean of just September’s gas bills and draw a horizontal reference line for this mean on your scatterplot. How does this mean compare to your prediction in Part (c)?
sepbill <- Utilities %>%
filter(month == 9)
mean(sepbill$gasbill, na.rm=TRUE)
## [1] 22.815
Type your answer here…
Part (e)
State the values of \(R^2\), \(MSE\), and the residual standard error. What do each of these values tell us about the regression model?
summary(lm.gassy)
##
## Call:
## lm(formula = gasbill ~ month + I(month^2), data = Utilities)
##
## Residuals:
## Min 1Q Median 3Q Max
## -117.440 -15.127 -0.692 10.528 86.617
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 284.4558 10.5374 27.00 <2e-16 ***
## month -76.6165 3.6609 -20.93 <2e-16 ***
## I(month^2) 5.4807 0.2712 20.21 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.48 on 114 degrees of freedom
## Multiple R-squared: 0.7937, Adjusted R-squared: 0.7901
## F-statistic: 219.3 on 2 and 114 DF, p-value: < 2.2e-16
Type your answer here…
Problem 2
View the mtcars dataset and corresponding help file
?mtcars.
Perform a regression that predicts the miles per gallon
mpg of the vehicle based on the quarter mile time
qsec and transmission type am of the
vehicle.
Part (a)
Plot the data and your fitted regression model.
vroom.lm <- lm(mpg ~ qsec, data=mtcars)
b <- coef(vroom.lm)
ggplot(mtcars, aes(y=mpg, x=qsec, color=factor(am))) +
geom_point(size = 1) +
geom_smooth(method = "lm", aes(group = as.factor(am)), se= FALSE) +
scale_color_manual(name = "am", values = c("skyblue", "orange"))+ theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
Part (b)
State the fitted regression model.
vroom.lm <- lm(mpg ~ qsec + am + qsec:am, data = mtcars)
summary(vroom.lm)
##
## Call:
## lm(formula = mpg ~ qsec + am + qsec:am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.4551 -1.4331 0.1918 2.2493 7.2773
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.0099 8.2179 -1.096 0.28226
## qsec 1.4385 0.4500 3.197 0.00343 **
## am -14.5107 12.4812 -1.163 0.25481
## qsec:am 1.3214 0.7017 1.883 0.07012 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.343 on 28 degrees of freedom
## Multiple R-squared: 0.722, Adjusted R-squared: 0.6923
## F-statistic: 24.24 on 3 and 28 DF, p-value: 6.129e-08
\[ ... \]
Part (c)
Use summary(...) to perform an appropriate t test to
determine if the interaction term is needed in this regression
model.
par(mfrow=c(1,3))
plot(vroom.lm, which=1)
qqPlot(vroom.lm, main= "Q-Q plot", pch = 16)
## Lotus Europa Volvo 142E
## 28 32
plot(vroom.lm$residuals, main="Residuals vs Order")
Type your answer here…
Part (d)
Diagnose the appropriateness of this regression model. How well does it fit the data?
Be sure to provide diagnostic plots and supporting arguments for your claims.
#Type your code here...
Type your answer here…
Part (e)
State the \(R^2\) and residual standard error values for this model. What do these values show?
#Type your code here...
Type your answer here…
Part (f)
Sometimes, when an interaction term is not significant, we may choose to use a reduced regression model that drops the interaction term. Mathematically this would turn the full two-lines model into this simpler mathematical model:
Note the absence of the term in this reduced two-lines model.
Fit this reduced model on the mtcars data from above. State the fitted regression model and plot the model.
vroomvroom.lm <- lm(mpg ~ qsec + am, data = mtcars)
summary(vroomvroom.lm)
##
## Call:
## lm(formula = mpg ~ qsec + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.3447 -2.7699 0.2938 2.0947 6.9194
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -18.8893 6.5970 -2.863 0.00771 **
## qsec 1.9819 0.3601 5.503 6.27e-06 ***
## am 8.8763 1.2897 6.883 1.46e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.487 on 29 degrees of freedom
## Multiple R-squared: 0.6868, Adjusted R-squared: 0.6652
## F-statistic: 31.8 on 2 and 29 DF, p-value: 4.882e-08
(Note that the interaction term is sometimes called the “change in slope term” and thus, once it is removed from the model, there is no more possibility for the two lines to have different slopes. The are forced to be parallel.)
b <- coef(vroomvroom.lm)
ggplot(mtcars, aes(y=mpg, x=qsec, color=factor(am))) +
geom_point(size = 1) +
geom_smooth(method = "lm", aes(group = as.factor(am)), se= FALSE) +
scale_color_manual(name = "am", values = c("coral", "aquamarine3"))+ theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
Problem 3
View the mtcars dataset and corresponding help file
?mtcars.
Perform a regression that predicts the quarter mile time
qsec of the vehicle based on the displacement of the engine
disp and transmission type am of the vehicle
according to the model:
\[ \underbrace{Y_i}_\text{qsec} = \beta_0 + \beta_1 \underbrace{X_{1i}}_\text{disp} + \beta_2 X_{1i}^2 + \beta_3 \underbrace{X_{2i}}_\text{am == 1} + \beta_4 X_{1i}X_{2i} + \beta_5 X_{1i}^2X_{2i} + \epsilon_i \]
where \(\epsilon_i \sim N(0, \sigma^2)\).
Part (a)
Plot the data and your fitted regression model.
mtcars$disp2 <- mtcars$disp^2
drivin.lm <- lm(qsec ~ disp + disp2 + am + disp:am + disp2:am, data = mtcars)
summary(drivin.lm)
##
## Call:
## lm(formula = qsec ~ disp + disp2 + am + disp:am + disp2:am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.59239 -0.58978 0.02637 0.43534 2.35282
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.619e+01 1.728e+00 15.155 2.02e-14 ***
## disp -4.937e-02 1.270e-02 -3.888 0.000626 ***
## disp2 6.607e-05 2.133e-05 3.097 0.004646 **
## am -3.663e+00 2.343e+00 -1.563 0.130076
## disp:am -2.926e-03 2.335e-02 -0.125 0.901222
## disp2:am 1.866e-05 5.143e-05 0.363 0.719707
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.033 on 26 degrees of freedom
## Multiple R-squared: 0.7197, Adjusted R-squared: 0.6658
## F-statistic: 13.35 on 5 and 26 DF, p-value: 1.695e-06
b <- coef(drivin.lm)
ggplot(mtcars, aes(x = disp, y = qsec, color = factor(am))) +
geom_point(size = 2) +
stat_function( fun = function(x) b[1] + b[2]*x + b[3]*x^2, color= "skyblue") +
stat_function( fun = function(x) b[3] + b[4]*x + b[3]*x^2, color= "orange")+
scale_color_manual(name = "Transmission Type", values = c("skyblue", "orange"),
labels = c("Automatic", "Manual")) +
theme_classic() +
labs(title = "Regression of qsec on disp and Transmission Type",
x = "Displacement (disp)",
y = "Quarter Mile Time (qsec)")
Type your answer here…
Part (b)
State the fitted regression model.
\[ \hat{Y}_i = ... \]
Part (c)
Use summary(...) to perform appropriate t tests to
determine which interaction terms are needed in this regression
model.
#Type your code here...
Part (d)
Diagnose the appropriateness of this regression model. How well does it fit the data?
Be sure to provide diagnostic plots and supporting arguments for your claims.
drivin.lm <- lm(qsec ~ disp + disp2 + am + disp:am + disp2:am, data = mtcars)
par(mfrow=c(1,3))
plot(drivin.lm, which=1)
qqPlot(drivin.lm, main= "Q-Q plot", pch = 16)
## Valiant Merc 230
## 6 9
plot(drivin.lm$residuals, main="Residuals vs Order")
Type your answer here…
Part (e)
Drop the “least significant” term from the model. This is a silly statement really, but is an accepted practice when searching for a “best” model in regression analysis. What has changed?
WITH “least significant term”
mtcars$disp2 <- mtcars$disp^2
drivin.lm <- lm(qsec ~ disp + disp2 + am + disp:am + disp2:am, data = mtcars)
summary(drivin.lm)
##
## Call:
## lm(formula = qsec ~ disp + disp2 + am + disp:am + disp2:am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.59239 -0.58978 0.02637 0.43534 2.35282
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.619e+01 1.728e+00 15.155 2.02e-14 ***
## disp -4.937e-02 1.270e-02 -3.888 0.000626 ***
## disp2 6.607e-05 2.133e-05 3.097 0.004646 **
## am -3.663e+00 2.343e+00 -1.563 0.130076
## disp:am -2.926e-03 2.335e-02 -0.125 0.901222
## disp2:am 1.866e-05 5.143e-05 0.363 0.719707
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.033 on 26 degrees of freedom
## Multiple R-squared: 0.7197, Adjusted R-squared: 0.6658
## F-statistic: 13.35 on 5 and 26 DF, p-value: 1.695e-06
WITHOUT least significant term
mtcars$disp2 <- mtcars$disp^2
swervin.lm <- lm(qsec ~ disp + disp2 + am + disp2:am, data = mtcars)
summary(swervin.lm)
##
## Call:
## lm(formula = qsec ~ disp + disp2 + am + disp2:am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.57899 -0.56846 0.03619 0.45453 2.33218
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.630e+01 1.442e+00 18.246 < 2e-16 ***
## disp -5.024e-02 1.046e-02 -4.803 5.18e-05 ***
## disp2 6.750e-05 1.768e-05 3.817 0.000717 ***
## am -3.939e+00 7.840e-01 -5.024 2.86e-05 ***
## disp2:am 1.238e-05 1.143e-05 1.083 0.288546
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.014 on 27 degrees of freedom
## Multiple R-squared: 0.7196, Adjusted R-squared: 0.678
## F-statistic: 17.32 on 4 and 27 DF, p-value: 3.766e-07
Problem 4
set.seed(101)
n <- 100
X_i <- runif(n, -2, 2)
X_2i <- sample(c(0,1), n, replace=TRUE)
beta0 <- 2
beta1 <- 5
beta2 <- -3
beta3<- -4
beta4<- -2
beta5<- 7
sigma <- 2
epsilon_i <- rnorm(n, 0, sigma)
Y_i <- beta0 + beta1*X_i + beta2*(X_i)^2 + beta3*X_2i + beta4*X_i*X_2i + beta5*(X_i)^2*X_2i + epsilon_i
fabData <- data.frame(x = X_i, y = Y_i, x2 = X_2i)
fab.lm <- lm(y ~ x + I(x^2) + x2 + x:x2 + I(x^2):x2, data = fabData)
b <- coef(fab.lm)
# betas are the solid, the coefficients from the lm are the dashed
ggplot(fabData, aes(x = x, y = y, color = as.factor(x2))) +
geom_point(size = 1, pch=19) +
scale_color_manual(name = "Groups", values = c("firebrick", "midnightblue")) +
stat_function(fun = function(x) beta0 + beta1*x + beta2*x^2, color = "firebrick", linetype = "solid", size = 1) +
stat_function(fun = function(x) (beta3 + beta0) + (beta4 + beta1)*x + (beta5 + beta2)*x^2, color = "midnightblue", linetype = "solid", size = 1) +
stat_function(fun = function(x) b[1] + b[2]*x + b[3]*x^2, color = "firebrick", linetype = "dashed", size = 1) +
stat_function(fun = function(x) (b[1] + b[4]) + (b[2] + b[5])*x + (b[3] + b[6])*x^2, color = "midnightblue", linetype = "dashed", size = 1) +
theme_minimal() +
labs(title = "Quadratic Regression for Two Models",
x = "X Values",
y = "Y Values")
Question 1: What is the equation of the regression function shown in this plot?
## Code to make this graphic...
ggplot(Wages1, aes(x=school, y=wage)) + geom_point(color=rgb(.3,.3,.2,.4)) + geom_smooth(method="lm", formula= y ~ poly(x, 2, raw=TRUE), se=F, col=rgb(.3,.3,.2,.8)) + theme_bw() + labs(title="Best wages for highest schooling... or no schooling?", subtitle="Wages1 data, library(Ecdat)", x="Number of years of Schooling (school)", y="Current Hourly Wage in 1980's USD (wage)")
## Code needed to obtain the solution...
summary(lm(wage ~ school + I(school^2), data=Wages1))
##
## Call:
## lm(formula = wage ~ school + I(school^2), data = Wages1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.337 -1.998 -0.474 1.444 34.554
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.13334 1.46274 4.193 2.82e-05 ***
## school -0.68354 0.25743 -2.655 0.00796 **
## I(school^2) 0.05488 0.01129 4.859 1.23e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.126 on 3291 degrees of freedom
## Multiple R-squared: 0.08636, Adjusted R-squared: 0.0858
## F-statistic: 155.5 on 2 and 3291 DF, p-value: < 2.2e-16
## Code to draw the plot in base graphics...
plot(wage ~ school, data=Wages1, pch=16, col=rgb(.3,.3,.2,.4), main="Best wages for highest schooling... or no schooling?", xlab="Number of years of Schooling (school)", ylab="Current Hourly Wage in 1980's USD (wage)")
mtext(side=3, text="Wages1 data, library(Ecdat)")
abline(v=seq(2,16,2), h=seq(0,40,10), lty=2, col=rgb(.2,.2,.2,.2))
wages.lm <- lm(wage ~ school + I(school^2), data=Wages1)
b <- coef(wages.lm)
curve(b[1] + b[2]*x + b[3]*x^2, add=TRUE, col=rgb(.3,.3,.2,.8), lwd=2)
Question 2: Which of the following graphics correctly depicts the regression model stated here.
\[Y_i = \beta_0 + \beta_1X_{1i} + \beta_2X_{2i} + \epsilon_i \text{ where } X_{2i} = \left\{\begin{array}{ll} 1, & \text{Blue Group} \\ 0, & \text{Orange Group} \end{array}\right. \]
Since the equation only allows the y-intercept to be changed by the \(\beta_2\) term, and there is no interaction term allowing the slope to change, then the correct answer is the plot that has lines with equal slopes. In other words, this model forces parallel lines. Only models that include an interaction term can have differing slopes.
Question 3: Open the Loblolly data set in R.
Which regression model shows the best linearity according to the residuals vs fitted values plot of that regression?
# Running the codes:
plot(lm(height ~ age + I(age^2), data=Loblolly), which=1)
plot(lm(height ~ age + I(age^2) + I(age^3), data=Loblolly), which=1)
plot(lm(height ~ age + I(age^2) + I(age^3) + I(age^4), data=Loblolly), which=1)
plot(lm(height ~ age + I(age^2) + I(age^3) + I(age^4) + I(age^5), data=Loblolly), which=1)
# will show the last one has the "flattest" red line, witnessing the greatest "linear" fit of the data. If you wanted to resolve the lack of constant variance, you could then apply:
boxCox(lm(height ~ age + I(age^2) + I(age^3) + I(age^4) + I(age^5), data=Loblolly))
# and the resulting transformation looks quite nice:
plot(lm(sqrt(sqrt(height)) ~ age + I(age^2) + I(age^3) + I(age^4) + I(age^5), data=Loblolly), which=1)
# except linearity is still lacking...
plot(lm(sqrt(sqrt(height)) ~ age + I(age^2) + I(age^3) + I(age^4) + I(age^5), data=Loblolly), which=2)
model_simple <- lm(y ~ x, data = data)
summary(model_simple)$r.squared
## [1] 0.3821555
summary(model_simple)$adj.r.squared
## [1] 0.3758509
ggplot(data, aes(x = x, y = y)) +
geom_point() +
geom_smooth(method = "loess", span = 0.5, se = FALSE, color = "blue")
## `geom_smooth()` using formula = 'y ~ x'
Problem 1
MODEL 1
Fit this model to the KidsFeet dataset. Use your model to answer the questions and re-create the graph shown below.
\[\overbrace{Y_i}^{length} = \beta_0 + \beta_1 \overbrace{X_{i1}}^{width} + \beta_2 X_{i1}^2 + \epsilon_i\]
feet.lm <- lm(length ~ width + I(width^2), data = KidsFeet)
summary(feet.lm)
##
## Call:
## lm(formula = length ~ width + I(width^2), data = KidsFeet)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6131 -1.0162 0.2720 0.6543 2.0057
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.9167 43.9401 0.522 0.605
## width -1.2848 9.8528 -0.130 0.897
## I(width^2) 0.1647 0.5512 0.299 0.767
##
## Residual standard error: 1.038 on 36 degrees of freedom
## Multiple R-squared: 0.4125, Adjusted R-squared: 0.3798
## F-statistic: 12.64 on 2 and 36 DF, p-value: 6.961e-05
b <- coef(feet.lm)
plot(length ~ width, data = KidsFeet, pch=19)
curve(b[1] + b[2]*x + b[3]*x^2, col = "black", lwd = 2, add=TRUE)
MODEL 2
Fit this model to the KidsFeet dataset. Use your model to answer the questions and re-create the graph shown below.
\[\overbrace{Y_i}^{length} = \beta_0 + \beta_1 \overbrace{X_{i1}}^{width} + \beta_2 \overbrace{X_{i2}}^{sex} + \epsilon_i\]
feet2.lm <- lm(length ~ width + sex + sex:width, data = KidsFeet)
summary(feet2.lm)
##
## Call:
## lm(formula = length ~ width + sex + sex:width, data = KidsFeet)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6035 -1.0350 0.2049 0.7072 2.0209
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.9314 4.9118 2.226 0.03258 *
## width 1.5423 0.5339 2.889 0.00659 **
## sexG -1.1856 6.6054 -0.179 0.85858
## width:sexG 0.1170 0.7328 0.160 0.87410
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.051 on 35 degrees of freedom
## Multiple R-squared: 0.4136, Adjusted R-squared: 0.3634
## F-statistic: 8.229 on 3 and 35 DF, p-value: 0.0002821
b <- coef(feet2.lm)
plot(length ~ width, data = KidsFeet, col= c("skyblue", "firebrick")[as.factor(sex)], pch = 19)
curve(b[1] + b[2]*x, col = "skyblue", lwd = 2, add = TRUE)
curve((b[1] + b[3]) + (b[2] + b[4])*x, col = "firebrick", lwd = 2, add = TRUE)
MODEL 3
Fit this model to the KidsFeet dataset. Use your model to answer the questions and re-create the graph shown below.
\[\overbrace{Y_i}^{length} = \beta_0 + \beta_1 \overbrace{X_{i1}}^{width} + \beta_2 X_{i1}^2 + \beta_3 \overbrace{X_{i2}}^{sex} + \beta_4 X_{i1}^2X_{i2} + \epsilon_i\]
feet3 <- lm(length ~ width + I(width^2) + sex + I(width^2):sex, data = KidsFeet)
summary(feet3)
##
## Call:
## lm(formula = length ~ width + I(width^2) + sex + I(width^2):sex,
## data = KidsFeet)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6285 -0.9834 0.1947 0.6172 2.0940
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.36370 66.71647 0.725 0.473
## width -6.63499 14.54635 -0.456 0.651
## I(width^2) 0.44556 0.79217 0.562 0.577
## sexG -2.59549 4.84822 -0.535 0.596
## I(width^2):sexG 0.03040 0.05963 0.510 0.614
##
## Residual standard error: 1.062 on 34 degrees of freedom
## Multiple R-squared: 0.419, Adjusted R-squared: 0.3507
## F-statistic: 6.131 on 4 and 34 DF, p-value: 0.0007948
plot(length ~ width, data = KidsFeet, col = c("skyblue", "firebrick")[as.factor(sex)], pch = 19)
b <- coef(feet3)
curve(b[1] + b[2]*x + b[3]*x^2, lwd = 2, add= TRUE, col = "skyblue")
curve((b[1]+b[4]) + (b[2])*x + (b[3] + b[5])*x^2, lwd = 2, add = TRUE, col = "firebrick")
MODEL 4
Fit this model to the KidsFeet dataset. Use your model to answer the questions and re-create the graph shown below.
\[\overbrace{Y_i}^{length} = \beta_0 + \beta_1 \overbrace{X_{i1}}^{width} + \epsilon_i\]
feet4 <- lm(length ~ width, data = KidsFeet)
summary(feet4)
##
## Call:
## lm(formula = length ~ width, data = KidsFeet)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6331 -1.0372 0.2299 0.7128 1.9642
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.8172 2.9381 3.341 0.00192 **
## width 1.6576 0.3262 5.081 1.1e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.025 on 37 degrees of freedom
## Multiple R-squared: 0.411, Adjusted R-squared: 0.3951
## F-statistic: 25.82 on 1 and 37 DF, p-value: 1.097e-05
plot(length ~ width, data = KidsFeet, pch = 19)
b <- coef(feet4)
curve(b[1] + b[2]*x, lwd = 2, add= TRUE)
Conclusion: - Which of the above models is the “best” model of the four based on the adjusted R-squared value? - Model 1 : Quadratic Model (Adjusted \(R^2\) = 0.3798) - Model 2 : Two-lines Model (Adjusted \(R^2\) = 0.3806) - Model 3 : Double Quadratic Model (Adjusted \(R^2\) = 0.3507) - Model 4 : Simple Linear Model (Adjusted \(R^2\) = 0.3951) - the BIGGEST adjusted \(R^2\) = “best” model
Problem 2
Part (a)
Create a scatterplot of the height of a tree on the
age of the tree. Overlay both a simple linear regression
line and a lowess curve, degree 1 on the plot. Does the lowess curve
suggest that the line is a good fit or a poor fit?
tree.lm <- lm(height ~ age, data = Loblolly)
plot(height ~ age, data = Loblolly)
b <- coef(tree.lm)
curve(b[1] + b[2]*x, lwd = 2, add = TRUE)
# Have the x value first and the y value second!!
lines(lowess(Loblolly$age, Loblolly$height), col="firebrick")
Part (b)
Re-draw the scatterplot, this time overlaying the curve from a quadratic regression model as well as the lowess curve of degree 1. Does this model look better or worse than the simple linear regression line when compared to the lowess curve?
Problem 3
Consider the ?Utilities data set. The goal of this question is to find a “best model” for predicting the monthly electricity bill, elecbill. Work through each of the parts below to do this.
Part (a)
Select (by clicking on it) the row of the pairs plot below that
corresponds to elecbill being the y-variable in each of the
plots. This is the main row you want to look at as you decide which
variable is most useful in explaining y, the elecbill.
Which variable is “visibly” the “best” according to the pairs plot for
predicting elecbill? Click on the correct plot and name the column. -
kwh
Part (b)
Run a regression for the following model where X1i is the variable you identified in Part (a).
\[\underbrace{Y_i}_\text{elecbill} = \beta_0 + \beta_1\underbrace{X_{1i}}_\text{kwh} + \epsilon_i\]
State the p-value for the β1 term. State the R2 value of the regression. Create the residuals vs fitted values plot for the regression. What do each of these show?
shocked <- lm(elecbill ~ kwh, data = Utilities)
summary(shocked) %>% pander()
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -5.001 | 4.522 | -1.106 | 0.2711 |
| kwh | 0.1088 | 0.005816 | 18.7 | 1.155e-36 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 117 | 12.66 | 0.7525 | 0.7503 |
plot(shocked, which=1)
This plot shows that there is a possible problem in the linearity of the regression. A transformation may be in order. But first, let’s search for other useful variables.
Part (c)
Create a new pairs plot that includes the residuals and fitted values from the previous regression.
Hint: `cbind(R = lm1\(res, Fit = lm1\)fit, Utilities)
uill <- cbind(R = shocked$residuals, Fit = shocked$fitted.values, Utilities)
There are a lot of interesting patterns in the pairs plot at this
point. Notice especially the plot for “month” in the first row of the
pairs plot. Also, notice how month is strongly related to
several variables in a quadratic way. Click on the scatterplots where
month, acting as the x-variable, is a strong predictor of those
variables.
Thus, while many variables could give some insight on the residuals,
month seems to explain most of those variables as well, so that would
suggest using month as a variable that could explain elecbill. Also,
month would be the best choice to actually use over any of the other
related variables, because it would be the easiest variable to “measure”
in real life. However, there is one other variable that does not relate
to month that also seems to shed some insight on the residuals from the
regression from Part (b). - This variable is year as it
shows a low correlation, but positive linear relationship and is not
related ot month at all according to the pairs plot.
Part (d)
Perform a second regression that adds both of these newly identified variables to the regression performed in Part (b). This should now be a regression of elecbill on three different explanatory variables, namely,
\[ \underbrace{Y_i}_\text{elecbill} = \beta_0 + \beta_1\underbrace{X_{1i}}_\text{kwh} + \beta_2 \underbrace{X_{2i}}_\text{month} + \beta_3 \underbrace{X_{3i}}_\text{year} + \epsilon_i \]
Report the p-values of each of the two new variables, as well as the adjusted \(R^2\) value of the regression, and create a residuals vs fitted values plot for the regression.
shocking <- lm(elecbill ~ kwh + month + year, data = Utilities)
summary(shocking)
##
## Call:
## lm(formula = elecbill ~ kwh + month + year, data = Utilities)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.6383 -6.1213 -0.4424 6.5672 22.3545
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.496e+03 6.209e+02 -8.852 1.37e-14 ***
## kwh 1.004e-01 4.973e-03 20.179 < 2e-16 ***
## month 3.716e-01 2.933e-01 1.267 0.208
## year 2.741e+00 3.099e-01 8.845 1.43e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.811 on 113 degrees of freedom
## Multiple R-squared: 0.854, Adjusted R-squared: 0.8501
## F-statistic: 220.3 on 3 and 113 DF, p-value: < 2.2e-16
It is not surprising that the month term is not significant because it has a quadratic, not linear, relationship with the residuals. We will keep the month term in the model until we add the quadratic month term as well, then decide if we should remove the month term or not.
The adjusted \(R^2\) of the
regression is
0.854. This shows an even tighter fit than the original
regression in Part (b), so our model is getting better.
plot(shocking, which=1)
This plot continues to suggest a possible y transformation to correct the constant variance problem. However, before we do that, let’s try adding in the quadratic month term.
Part (e)
Add to the regression model of Part (d) a quadratic month term. The model is now given by
\[ \underbrace{Y_i}_\text{elecbill} = \beta_0 + \beta_1\underbrace{X_{1i}}_\text{kwh} + \beta_2 \underbrace{X_{2i}}_\text{month} + \beta_3 \underbrace{X_{3i}}_\text{year} + \beta_4 \underbrace{X_{2i}^2}_\text{month^2} + \epsilon_i \]
Note how the p-values of all terms have now changed. State the adjusted R2 value of the regression. Create a residuals vs fitted values plot of the regression.
What does all of this show?
shockz <- lm(elecbill ~ kwh + month + year + I(month^2), data = Utilities)
summary(shockz)
##
## Call:
## lm(formula = elecbill ~ kwh + month + year + I(month^2), data = Utilities)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.347 -3.995 -0.183 4.443 23.561
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.327e+03 4.874e+02 -10.931 < 2e-16 ***
## kwh 1.106e-01 4.085e-03 27.083 < 2e-16 ***
## month 8.118e+00 9.432e-01 8.607 5.31e-14 ***
## year 2.644e+00 2.433e-01 10.867 < 2e-16 ***
## I(month^2) -6.072e-01 7.171e-02 -8.468 1.10e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.694 on 112 degrees of freedom
## Multiple R-squared: 0.911, Adjusted R-squared: 0.9078
## F-statistic: 286.5 on 4 and 112 DF, p-value: < 2.2e-16
The residual plot ironically now shows a problem with linearity and constant variance.
plot(shockz, which=1)
Part(f)
Let’s check one more time if any other variables should be added to the model before we try a transformation.
Create another pairs plot that uses the residuals and fitted values from the regression in Part (d). What do you see?
Hint: Look carefully at the residuals vs month, residuals vs year, and residuals vs temp plots. While residuals vs ccf and residuals vs ThermsPerDay also look interesting, these variables would be harder to measure than month, year, and temp. So let’s consider variables that are easiest to measure first.
Part (g)
Perform a regression that includes temp and I(temp^2) into the regression from Part (e). So the model now becomes
\[ \underbrace{Y_i}_\text{elecbill} = \beta_0 + \beta_1\underbrace{X_{1i}}_\text{kwh} + \beta_2 \underbrace{X_{2i}}_\text{month} + \beta_3 \underbrace{X_{3i}}_\text{year} + \beta_4 \underbrace{X_{2i}^2}_\text{month^2} + \beta_5 \underbrace{X_{4i}}_\text{temp} + \beta_6 \underbrace{X_{4i}^2}_\text{temp^2} + \epsilon_i \]
plzimded <- lm(elecbill ~ kwh + month + year + I(month^2) + temp + I(temp^2), data = Utilities)
summary(plzimded)
##
## Call:
## lm(formula = elecbill ~ kwh + month + year + I(month^2) + temp +
## I(temp^2), data = Utilities)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.924 -2.982 0.109 3.552 23.950
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.399e+03 4.448e+02 -12.138 < 2e-16 ***
## kwh 1.051e-01 3.896e-03 26.985 < 2e-16 ***
## month 3.524e+00 2.676e+00 1.317 0.190623
## year 2.690e+00 2.222e-01 12.108 < 2e-16 ***
## I(month^2) -2.632e-01 1.929e-01 -1.364 0.175204
## temp -5.457e-01 2.369e-01 -2.304 0.023128 *
## I(temp^2) 7.964e-03 2.055e-03 3.876 0.000181 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.019 on 110 degrees of freedom
## Multiple R-squared: 0.9272, Adjusted R-squared: 0.9233
## F-statistic: 233.6 on 6 and 110 DF, p-value: < 2.2e-16
Part(h)
Drop month and I(month^2) from the model in Part (g) but leave all other terms in the model. This reduces the model to
\[ \underbrace{Y_i}_\text{elecbill} = \beta_0 + \beta_1\underbrace{X_{1i}}_\text{kwh} + \beta_3 \underbrace{X_{3i}}_\text{year} + \beta_5 \underbrace{X_{4i}}_\text{temp} + \beta_6 \underbrace{X_{4i}^2}_\text{temp^2} + \epsilon_i \]
bbqchik <- lm(elecbill ~ kwh + year+ temp + I(temp^2), data = Utilities)
summary(bbqchik)
##
## Call:
## lm(formula = elecbill ~ kwh + year + temp + I(temp^2), data = Utilities)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.706 -2.700 0.208 3.705 23.820
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.462e+03 4.354e+02 -12.544 < 2e-16 ***
## kwh 1.029e-01 3.328e-03 30.917 < 2e-16 ***
## year 2.723e+00 2.173e-01 12.530 < 2e-16 ***
## temp -3.780e-01 1.814e-01 -2.084 0.039481 *
## I(temp^2) 7.466e-03 1.930e-03 3.868 0.000184 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.023 on 112 degrees of freedom
## Multiple R-squared: 0.9258, Adjusted R-squared: 0.9232
## F-statistic: 349.5 on 4 and 112 DF, p-value: < 2.2e-16
plot(bbqchik, which = 1)
It looks like temp gives more insight than does month. So even though we chose month earlier in our model search process, it turns out that temp is the better variable in the end. So we dropped month and switched to temp.
Part(i)
Though we could check another pairs plot at this point (it actually doesn’t show anything useful), let’s try removing the outlier identified in Part (h) and a y-transformation instead. Run boxCox(…) on your lm from Part (h) after redoing the regression with the outlier removed. Which value of λ is suggested?
boxCox(bbqchik)
Any value of λ between roughly
0.5 and 1 is a possible value. Thus, leaving our
regression as is would be acceptable. And, since we are tired at this
point, that sounds like a great idea.
Part (j)
State the equation for the final regression model for predicting elecbill. Be sure observation 95 is removed, i.e., data=Utilities[-95,]…
burntbbq <- lm(elecbill ~ kwh + year+ temp + I(temp^2), data = Utilities[-95,])
summary(burntbbq)
##
## Call:
## lm(formula = elecbill ~ kwh + year + temp + I(temp^2), data = Utilities[-95,
## ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5193 -3.7173 -0.1233 3.1444 24.6028
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.773e+03 3.665e+02 -15.752 < 2e-16 ***
## kwh 9.969e-02 2.818e-03 35.380 < 2e-16 ***
## year 2.880e+00 1.829e-01 15.744 < 2e-16 ***
## temp -5.180e-01 1.529e-01 -3.389 0.000973 ***
## I(temp^2) 9.226e-03 1.632e-03 5.654 1.23e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.868 on 111 degrees of freedom
## Multiple R-squared: 0.9471, Adjusted R-squared: 0.9452
## F-statistic: 496.6 on 4 and 111 DF, p-value: < 2.2e-16
imdone <- lm(elecbill ~ year + I(month^2), data = Utilities)
summary(imdone)
##
## Call:
## lm(formula = elecbill ~ year + I(month^2), data = Utilities)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.934 -14.270 0.354 16.742 44.315
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.029e+03 1.304e+03 -6.156 1.15e-08 ***
## year 4.037e+00 6.504e-01 6.208 8.96e-09 ***
## I(month^2) 1.961e-01 4.276e-02 4.587 1.16e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.1 on 114 degrees of freedom
## Multiple R-squared: 0.3187, Adjusted R-squared: 0.3068
## F-statistic: 26.67 on 2 and 114 DF, p-value: 3.157e-10
Part 1 - Weighted Regressions
X <- c(1, 4, 7, 8, 10, 20)
Y <- c(3, 5, 18, 13, 12, 1)
w <- c(1, .5, .2, 1, 1, .1)
mylm <- lm(Y ~ X, weights=w)
plot(Y ~ X, pch=21, bg=rgb(1-w,1-w,1-w), col="orange")
abline(mylm)
Then, reproduce each of the graphics below by changing only the weights, w, and re-running the above codes of mylm, plot, and abline.
X <- c(1, 4, 7, 8, 10, 20)
Y <- c(3, 5, 18, 13, 12, 1)
# left to right 1st, 2nd, 3rd, 4th, 5th, and 6th dots effecting how much weight they have and how they effect the slope of the line! (numbers cannot be bigger than 1)
w <- c(1, 1, 0, 0, 0, 0)
mylm <- lm(Y ~ X, weights=w)
plot(Y ~ X, pch=21, bg=rgb(1-w,1-w,1-w), col="orange")
abline(mylm)
X <- c(1, 4, 7, 8, 10, 20)
Y <- c(3, 5, 18, 13, 12, 1)
# left to right 1st, 2nd, 3rd, 4th, 5th, and 6th dots effecting how much weight they have and how they effect the slope of the line! (numbers cannot be bigger than 1)
w <- c(0, 0, 1, 0, 0, 1)
mylm <- lm(Y ~ X, weights=w)
plot(Y ~ X, pch=21, bg=rgb(1-w,1-w,1-w), col="orange")
abline(mylm)
X <- c(1, 4, 7, 8, 10, 20)
Y <- c(3, 5, 18, 13, 12, 1)
# left to right 1st, 2nd, 3rd, 4th, 5th, and 6th dots effecting how much weight they have and how they effect the slope of the line! (numbers cannot be bigger than 1)
w <- c(1, 1, 0.4, 1, 1, 0.2)
mylm <- lm(Y ~ X, weights=w)
plot(Y ~ X, pch=21, bg=rgb(1-w,1-w,1-w), col="orange")
abline(mylm)
X <- c(1, 4, 7, 8, 10, 20)
Y <- c(3, 5, 18, 13, 12, 1)
# left to right 1st, 2nd, 3rd, 4th, 5th, and 6th dots effecting how much weight they have and how they effect the slope of the line! (numbers cannot be bigger than 1)
w <- c(0.3, 0.4, 1, 1, 1, 1)
mylm <- lm(Y ~ X, weights=w)
plot(Y ~ X, pch=21, bg=rgb(1-w,1-w,1-w), col="orange")
abline(mylm)
Part 2 - Loess Curves: How they Work
Change the code so that quadratic regression curves are used to fit the loess curve instead of lines.
# Get some data for X and Y
X <- cars$speed
Y <- cars$dist
# Ensure the data has no missing values in any of the X,Y pairs
keep <- which(!is.na(X) & !is.na(Y))
X <- X[keep]
Y <- Y[keep]
# Decide on the fraction of points to use for the local regressions
f <- 1/2
# Identify the total sample size and sample size corresponding to f
n <- length(X) #total sample size
nn <- floor(n*f) #number of dots corresponding to percentage f
# Create storage for the lowess fitted-values
lfit <- rep(NA,n)
# Begin the for loop to compute lowess fitted-values
for (xc in 1:n){ #let each dot be the center dot, one at a time
xdists <- X - X[xc] #compute distance from all X-values to center dot x-value
r <- sort(abs(xdists))[nn] #locate the nn"th" largest x-distance from center x-value
xdists.nbrhd <- which(abs(xdists) < r) #identify x-values within the neighborhood
w <- rep(0, length(xdists)) #initialize the weights for all x-values at 0
w[xdists.nbrhd] <- (1 - abs(xdists[xdists.nbrhd]/r)^3)^3 #tri-cubic weight function
plot(Y ~ X, pch=21, bg=rgb(.53,.81,.92, w), #color dots by their weights
col=rgb(.2,.2,.2,.3), cex=1.5, yaxt='n', xaxt='n', xlab="", ylab="")
points(Y[xc] ~ X[xc], pch=16, col="orange") #add center dot to plot
lmc <- lm(Y ~ X, weights=w) #run weighted regression
curve(lmc$coef[1] + lmc$coef[2]*x, #draw line on neighborhood using from and to
from=min(X[xdists.nbrhd]), to=max(X[xdists.nbrhd]), col="orange", add=TRUE)
lines(lfit[1:xc] ~ X[1:xc], col="gray") #add lowess line up to current center dot
#lines(lowess(X,Y), col=rgb(0.698,0.133,0.133,.2))
cat("\n\n")
readline(prompt=paste0("Center point is point #", xc, "... Press [enter] to continue..."))
MADnotThereYet <- TRUE
count <- 0
while(MADnotThereYet){ #while loop will continue until "MadnotThereYet" becomes false
readline(prompt=paste0("\n Adjusting line to account for outliers in the y-direction... Press [enter] to continue..."))
# overwrite current regression to lighter color if new regression still needed:
curve(lmc$coef[1] + lmc$coef[2]*x, from=min(X[xdists.nbrhd]), to=max(X[xdists.nbrhd]), col="wheat", add=TRUE)
# update the weights
MAD <- median(abs(lmc$res))
resm <- lmc$res/(6*MAD)
resm[resm>1] <- 1
bisq <- (1-resm^2)^2
w <- w*bisq
obs <- coef(lmc)
lmc <- lm(Y ~ X, weights=w)
curve(lmc$coef[1] + lmc$coef[2]*x, from=min(X[xdists.nbrhd]), to=max(X[xdists.nbrhd]), col="orange", add=TRUE)
#stopping criterion for the weighted regressions in y-direction
count <- count + 1
if ( (sum(abs(obs-lmc$coef))<.1) | (count > 3))
MADnotThereYet <- FALSE
}
curve(lmc$coef[1] + lmc$coef[2]*x, from=min(X[xdists.nbrhd]), to=max(X[xdists.nbrhd]), col="green", add=TRUE)
points(lmc$coef[1] + lmc$coef[2]*X[xc] ~ X[xc], pch=16, col="green")
readline(prompt=paste0("\n Use final line to get fitted value for this point... Press [enter] to continue to next point..."))
lfit[xc] <- predict(lmc, data.frame(X=X[xc]))
lines(lfit[1:xc] ~ X[1:xc], col="gray")
if (xc == n){
readline(prompt=paste0("\n Press [enter] to see actual Lowess curve..."))
lines(lowess(X,Y, f=f), col="firebrick")
legend("topleft", bty="n", legend="Actual lowess Curve using lowess(...)", col="firebrick", lty=1)
}
}
##
##
## Center point is point #1... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #2... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #3... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #4... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #5... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #6... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #7... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #8... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #9... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #10... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #11... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #12... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #13... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #14... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #15... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #16... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #17... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #18... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #19... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #20... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #21... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #22... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #23... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #24... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #25... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #26... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #27... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #28... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #29... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #30... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #31... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #32... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #33... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #34... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #35... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #36... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #37... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #38... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #39... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #40... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #41... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #42... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #43... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #44... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #45... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #46... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #47... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #48... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #49... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
##
## Center point is point #50... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
##
## Use final line to get fitted value for this point... Press [enter] to continue to next point...
##
## Press [enter] to see actual Lowess curve...
set.seed(123)
x <- runif(100, 0, 10)
y_true <- 1 + 2 * x - 0.5 * x^2 + rnorm(100, 0, 2)
y_observed <- y_true
df <- data.frame(x = x, y = y_observed)
model_quad <- lm(y ~ poly(x, 2), data = df)
summary(model_quad)
##
## Call:
## lm(formula = y ~ poly(x, 2), data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3911 -1.2490 -0.0851 1.1498 4.4525
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.5849 0.1944 -28.73 <2e-16 ***
## poly(x, 2)1 -86.5435 1.9437 -44.52 <2e-16 ***
## poly(x, 2)2 -34.2285 1.9437 -17.61 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.944 on 97 degrees of freedom
## Multiple R-squared: 0.9594, Adjusted R-squared: 0.9586
## F-statistic: 1146 on 2 and 97 DF, p-value: < 2.2e-16
plot(x, y, pch = 19, main = "True vs Estimated Model")
curve(1 + 2 * x - 0.5 * x^2, add = TRUE, col = "green", lwd = 2)
lines(sort(x), predict(model_quad)[order(x)], col = "red", lwd = 2)
legend("topright", legend = c("True Model", "Estimated Model"), col = c("green", "red"), lwd = 2)
set.seed(321)
train_indices <- sample(1:nrow(df), 0.7 * nrow(df))
train <- df[train_indices, ]
test <- df[-train_indices, ]
model_train <- lm(y ~ poly(x, 2), data = train)
preds <- predict(model_train, newdata = test)
mse <- mean((test$y - preds)^2)
mse
## [1] 3.522266
model_real <- lm(mpg ~ wt + hp + factor(cyl), data = mtcars)
summary(model_real)
##
## Call:
## lm(formula = mpg ~ wt + hp + factor(cyl), data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2612 -1.0320 -0.3210 0.9281 5.3947
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.84600 2.04102 17.563 2.67e-16 ***
## wt -3.18140 0.71960 -4.421 0.000144 ***
## hp -0.02312 0.01195 -1.934 0.063613 .
## factor(cyl)6 -3.35902 1.40167 -2.396 0.023747 *
## factor(cyl)8 -3.18588 2.17048 -1.468 0.153705
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.44 on 27 degrees of freedom
## Multiple R-squared: 0.8572, Adjusted R-squared: 0.8361
## F-statistic: 40.53 on 4 and 27 DF, p-value: 4.869e-11
pairs(mtcars[, c("mpg", "wt", "hp", "cyl")])
set.seed(100)
x <- rnorm(100)
z <- 1.5 * x - 0.75
p <- 1 / (1 + exp(-z))
y <- rbinom(100, 1, p)
log_data <- data.frame(x = x, y = factor(y))
model_logit <- glm(y ~ x, data = log_data, family = binomial)
summary(model_logit)
##
## Call:
## glm(formula = y ~ x, family = binomial, data = log_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.8835 0.2788 -3.169 0.00153 **
## x 1.9267 0.4123 4.673 2.96e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 130.684 on 99 degrees of freedom
## Residual deviance: 88.394 on 98 degrees of freedom
## AIC: 92.394
##
## Number of Fisher Scoring iterations: 5
prob <- predict(model_logit, type = "response")
pred_class <- ifelse(prob > 0.5, 1, 0)
table(Predicted = pred_class, Actual = y)
## Actual
## Predicted 0 1
## 0 55 17
## 1 9 19
Problem 1
** Part (a)**
Is birthmonth enough information to predict the birthyear of a fourth grade student? Fit a logistic regression of birthyear == 88 on the birthmonth of a child to find out.
Enter the values of your estimates for \(\beta_0\) and \(\beta_1\) in the logistic model:
\[ P(\text{birthyear}_i = 88 \, | \, \text{birthmonth}_i) = \frac{e^{\beta_0 + \beta_1 \text{birthmonth}_i}}{1 + e^{\beta_0 + \beta_1 \text{birthmonth}_i}} \]
babies <- KidsFeet %>%
mutate(year88 = case_when(
birthyear %in% 88 ~ 1,
birthyear %in% 87 ~ 0))
babe <- glm(year88 ~ birthmonth, data = babies, family=binomial)
summary(babe)
##
## Call:
## glm(formula = year88 ~ birthmonth, family = binomial, data = babies)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.9665 3.8307 2.602 0.00928 **
## birthmonth -0.9992 0.3977 -2.513 0.01198 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 36.708 on 38 degrees of freedom
## Residual deviance: 16.921 on 37 degrees of freedom
## AIC: 20.921
##
## Number of Fisher Scoring iterations: 7
Part (b)
Fit a new model using only the width of the child’s foot instead of birthmonth to predict the birthyear of the child. Is foot width a better or worse predictor of the birthyear of the child than birthmonth?
State the AIC, null deviance, and residual deviance of both logistic regression models.
baby <- glm(year88 ~ width, data = babies, family=binomial)
summary(baby)
##
## Call:
## glm(formula = year88 ~ width, family = binomial, data = babies)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 15.5140 9.0614 1.712 0.0869 .
## width -1.5363 0.9824 -1.564 0.1179
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 36.708 on 38 degrees of freedom
## Residual deviance: 33.877 on 37 degrees of freedom
## AIC: 37.877
##
## Number of Fisher Scoring iterations: 5
Part (c)
Use the better model of the two models from parts (a) and (b) to predict the probability that a fourth grade boy (from the same era as the students in this data) was born in 88 given they had a foot width of 9.5 cm, a foot length of 24, and was born in May. (This is more information than you need to make the prediction.)
predict(babe, newdata= data.frame(birthmonth = 2), type = "response")
## 1
## 0.9996538
Part (d)
Interpret the effect on the odds that the slope term from the model used in Part (c) has on the odds of a fourth grade child being born in 88.
1-exp(-0.9992)
## [1] 0.6318261
Answer: “The odds of the child being born in 88 decrease by 63% for each one month increase in the birthmonth (1 being January and 12 being December). In other words, a child born in February has an odds of being born in 88 that is 63% less than a child that was born in January. And the same is true for any two consecutive months.”
Problem 2
Consider the ?RailTrail data set in library(mosaicData).
As shown by the boxplot below, the number of users on the rail trail
seems to be influenced by whether or not there was any precipitation on
that day.
boxplot(volume ~ precip>0, data=RailTrail, col="skyblue", xlab="90 Days of Records from April to November", ylab="Total Number of Users on the Trail that Day", main="Rain Seems to Reduce the Number of Users", names = c("Days with No Rain", "Rainy Days"))
The goal of this question is to identify a logistic regression model that could be used to predict if there will be rain on a given day or not.
Run the following commands to reduce the data to variables of interest for this problem.
data("RailTrail")
# Use dplyr::select to FORCE the correct version
RT <- RailTrail %>%
mutate(precipOccurred = (precip > 0)) %>%
dplyr::select(precipOccurred, lowtemp, spring, summer, fall, cloudcover, weekday)
pairs(RT)
Part (a)
Create a pairs plot that shows how well each variable in the RT data set can explain whether or not precipitation occurred on a given day. State which variables seem to be the strongest predictors of precipitation occurring.
(Note that a major limitation of this data is that all measurements are on the “day of” the precipitation. The data could potentially be more insightful if it contained the temperature or other information for the day prior, or two days prior, or so on… as well.)
#Hint: pairs(..., pch=16, col=rgb(.2,.2,.2,.1))
#Or: pairs(..., pch=16, col=as.factor(yourData$yourYvariable))
cloudcover variable looks to be a good variable
to predict precipOccurred
Part (b)
Fit three logistic regression models. The first should use lowtemp to predict precipitation. The second should use cloudcover to predict precipitation. The third should use fall to predict precipitation.
Compare the AIC and residual deviance of each model, which one is “better”?
cool <- glm(precipOccurred ~ lowtemp, data = RT, family=binomial)
summary(cool)
##
## Call:
## glm(formula = precipOccurred ~ lowtemp, family = binomial, data = RT)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.32489 0.96286 -2.415 0.0158 *
## lowtemp 0.03377 0.01967 1.717 0.0860 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 113.14 on 89 degrees of freedom
## Residual deviance: 110.09 on 88 degrees of freedom
## AIC: 114.09
##
## Number of Fisher Scoring iterations: 4
cloudy <- glm(precipOccurred ~ cloudcover, data = RT, family=binomial)
summary(cloudy)
##
## Call:
## glm(formula = precipOccurred ~ cloudcover, family = binomial,
## data = RT)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.8900 1.0927 -4.475 7.63e-06 ***
## cloudcover 0.6081 0.1404 4.330 1.49e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 113.136 on 89 degrees of freedom
## Residual deviance: 78.098 on 88 degrees of freedom
## AIC: 82.098
##
## Number of Fisher Scoring iterations: 5
pumpkinspice <- glm(precipOccurred ~ fall, data = RT, family=binomial)
summary(pumpkinspice)
##
## Call:
## glm(formula = precipOccurred ~ fall, family = binomial, data = RT)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.6931 0.2402 -2.886 0.0039 **
## fall -0.4055 0.7086 -0.572 0.5672
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 113.14 on 89 degrees of freedom
## Residual deviance: 112.79 on 88 degrees of freedom
## AIC: 116.79
##
## Number of Fisher Scoring iterations: 4
Part (c)
Try every possible 2-variable (no interaction term) logistic regression model that uses both (1) the best variable from Part (b), and (2) each of the other variables in the RT data set, one at a time. Note the AIC of each model. The best two-variable model for this data is has an AIC of 82.75. State the model.
What is the p-value of both variables in this model? Which p-value is not significant?
What is the AIC of this model? Is it better or worse than the “best” one-variable model from Part (b)?
whoscontrollingthisthang <- glm( precipOccurred ~ cloudcover + weekday, data = RT, family=binomial)
summary(whoscontrollingthisthang)
##
## Call:
## glm(formula = precipOccurred ~ cloudcover + weekday, family = binomial,
## data = RT)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.5215 1.1073 -4.083 4.44e-05 ***
## cloudcover 0.6322 0.1430 4.420 9.89e-06 ***
## weekdayTRUE -0.7471 0.6521 -1.146 0.252
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 113.136 on 89 degrees of freedom
## Residual deviance: 76.753 on 87 degrees of freedom
## AIC: 82.753
##
## Number of Fisher Scoring iterations: 5
However, the p-value for the weekday variable is not
significant in this model. Also, it is interesting to note that the AIC
of this mode is higher than the one-variable model that
just uses cloudcover to predict precipitation occurring.
This is a good witness that the AIC helps pretect against over-fitting
models.
cooksd <- cooks.distance(model)
plot(cooksd, type = "h", main = "Cook's Distance")
abline(h = 4 / length(y), col = "red")
model_robust <- rlm(y ~ x)
summary(model_robust)
##
## Call: rlm(formula = y ~ x)
## Residuals:
## Min 1Q Median 3Q Max
## -0.65018 -0.30879 -0.04683 0.27767 0.85091
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 0.3500 0.0443 7.8950
## x 0.2818 0.0436 6.4551
##
## Residual standard error: 0.4324 on 98 degrees of freedom